（12）使用JIT引擎

最后更新于：2022-04-01 14:36:23

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 使用JIT引擎 LLVM从设计之初就考虑了解释执行的功能,这非常其作为一款跨平台的中间字节码来使用,可以方便地跨平台运行。又具有编译型语言的优势，非常的方便。我们使用的LLVM3.6版，移除了原版JIT，改换成了新版的MCJIT，性格有了不小的提升，本文就MCJIT的使用和注意事项，进行简要的介绍。 ### JIT技术 Just-In-Time Compiler，是一种动态编译中间代码的方式，根据需要，在程序中编译并执行生成的机器码，能够大幅提升动态语言的执行速度。像Java语言，.net平台，luajit等，广泛使用jit技术，使得程序达到了非常高的执行效率，逐渐接近原生机器语言代码的性能了。 JIT引擎的工作原理并没有那么复杂，本质上是将原来编译器要生成机器码的部分要直接写入到当前的内存中，然后通过函数指针的转换，找到对应的机器码并进行执行。但实践中往往需要处理许多头疼的问题，例如内存的管理，符号的重定向，处理外部符号，相当于要处理编译器后端的诸多复杂的事情，真正要设计一款能用的JIT引擎还是非常困难的。 ### 使用LLVM的MCJIT能开发什么当然基本的功能是提供一款解释器的底层工具，将LLVM字节码解释执行，具体能够做的事，例如可以制作一款跨平台的C++插件系统，使用clang将C/C++代码一次编译到`.bc`字节码，然后在各个平台上解释运行。也可以制作一款云调试系统，联网远程向系统注册方法，获取C++客户端的debug信息等等。当然，还有很多其他的用法等着大家来开发。 ### 使用MCJIT做一款解释器制作LLVM字节码的解释器还是非常简单的，最棒的示例应该是LLVM源码中的工具：lli 一共700行左右的C++代码，调用LLVM工具集实现了LLVM字节码JIT引擎，如果想很好的学习llvm中的解释器和JIT，可以参考其在[github上的源码](https://github.com/llvm-mirror/llvm/blob/master/tools/lli/lli.cpp)。 ### 初始化系统使用LLVM的JIT功能，需要调用几条初始化语句，可以放在main函数开始时。 ~~~ InitializeNativeTarget(); InitializeNativeTargetAsmPrinter(); InitializeNativeTargetAsmParser(); ~~~ 这几句调用，主要是在处理JIT的TargetMachine，初始化机器相关编译目标。 ### 引用相关的头文件这里的稍稍有点多余的，不去管了。，llvm的头文件是层次组织的，像执行引擎，都在`llvm/ExecutionEngine/`下，而IR相关的，也都在`llvm/IR/`下，初用LLVM往往搞不清需要哪些，这时就需要多查相关的文档，了解LLVM的各个模块的功能。 ~~~ #include "llvm/ExecutionEngine/GenericValue.h" #include "llvm/ExecutionEngine/MCJIT.h" #include "llvm/ExecutionEngine/Interpreter.h" #include "llvm/ExecutionEngine/SectionMemoryManager.h" #include "llvm/IR/Verifier.h" #include "llvm/IR/Constants.h" #include "llvm/IR/DerivedTypes.h" #include "llvm/IR/IRBuilder.h" #include "llvm/IR/Instructions.h" #include "llvm/IR/LLVMContext.h" #include "llvm/IR/Module.h" #include <llvm/IRReader/IRReader.h> #include <llvm/Support/SourceMgr.h> #include "llvm/Support/ManagedStatic.h" #include "llvm/Support/TargetSelect.h" #include <llvm/Support/MemoryBuffer.h> #include "llvm/Support/raw_ostream.h" #include <llvm/Support/DynamicLibrary.h> #include "llvm/Support/Debug.h" ~~~ 主要说要注意的几个细节，首先是 ~~~ #include "llvm/ExecutionEngine/MCJIT.h" #include "llvm/ExecutionEngine/Interpreter.h" ~~~ C++编译时，这两个头文件居然不是必须的，如果你不注意时，编译不会报错。因为执行引擎是一个接口的模式，不对外暴露子类的细节，我们必须注意引用其中一个或两个都引用，否则会链接不到对应的引擎。会报如下错误： ~~~ Create Engine Error JIT has not been linked in. ~~~ ![类结构](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1bb63a4.jpg "") ### 使用EngineBuilder构建JIT引擎由于JIT引擎我们不需要创建多个，我们这里使用单例类的方式，使用一个LLVM中的Module进行初始化，如果引擎已经创建过，我们可以使用addModule方法，将LLVM的Module添加到引擎的Module集合中。 finalizeObject函数，是一个关键的函数，对应JIT引擎很重要，我们要保障我们在调用JIT编译后的代码前，要调用过该函数 ~~~ ExecutionEngine* EE = NULL; RTDyldMemoryManager* RTDyldMM = NULL; void initEE(std::unique_ptr<Module> Owner) { string ErrStr; if (EE == NULL) { RTDyldMM = new SectionMemoryManager(); EE = EngineBuilder(std::move(Owner)) .setEngineKind(EngineKind::JIT) .setErrorStr(&ErrStr) .setVerifyModules(true) .setMCJITMemoryManager(std::unique_ptr<RTDyldMemoryManager>(RTDyldMM)) .setOptLevel(CodeGenOpt::Default) .create(); } else EE->addModule(std::move(Owner)); if (ErrStr.length() != 0) cerr << "Create Engine Error" << endl << ErrStr << endl; EE->finalizeObject(); } ~~~ 这里是`finalizeObject`的文档解释： finalizeObject - ensure the module is fully processed and is usable. It is the user-level function for completing the process of making the object usable for execution. It should be called after sections within an object have been relocated using mapSectionAddress. When this method is called the MCJIT execution engine will reapply relocations for a loaded object. This method has no effect for the interpeter. `setEngineKind`可选的有`JIT`和`Interpreter`，如果默认的话，则是优先`JIT`，检测到哪个引擎能用就用哪个。 `setMCJITMemoryManager`是一个关键的管理器，当然貌似默认不写也会构建，这里我们为了清晰所见，还是添加了这条配置，这个内存管理器在执行引擎中很重要，一般本地的应用我们要选择`SectionMemoryManager`类，而lli中甚至还包含着远程调用的相关类。 `setOptLevel`是设置代码的优化等级，默认是`O2`，可以修改为下面枚举值： - None - Less - Default - Aggressive MCJIT架构图 ![MCJIT架构图](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1bc8fc2.jpg "") ### 编写核心的调用方法 ~~~ typedef void (*func_type)(void*); // path是bc文件的路径，func_name是要执行的函数名 void Run(const std::string& path, const std::string& func_name) { // 首先要读取要执行的bc字节码 SMDiagnostic error; std::unique_ptr<Module> Owner = parseIRFile(path, error, context); if(Owner == nullptr) { cout << "Load Error: " << path << endl; Owner->dump(); return; } // 单例的方法进行初始化，暂未考虑多线程 initEE(std::move(Owner)); // 获取编译后的函数指针并执行 uint64_t func_addr = EE->getFunctionAddress(func_name.c_str()); if (func_addr == 0) { printf("错误, 找不到函数: %s\n", func_name.c_str()); return; } func_type func = (func_type) func_addr; func(NULL); // 需要传参数时可以从这里传递 } ~~~ ### 解释器版本解释器效率稍低一下，不过能够做到惰性的一下代码载入和执行工作，有时也很有用途。下面我们就在jit的基础上，介绍一下简单的解释器功能。介绍器最主要需要做的就是将生成引擎改变： ~~~ EE = EngineBuilder(std::move(Owner)) // 这里改完解释器 .setEngineKind(EngineKind::Interpreter) .setErrorStr(&ErrStr) .setVerifyModules(true) .setMCJITMemoryManager(std::unique_ptr<RTDyldMemoryManager>(RTDyldMM)) .setOptLevel(CodeGenOpt::Default) .create(); ~~~ 另外解释器可以使用`getLazyIRFileModule`函数可以替换`parseIRFile`实现`.bc`文件的惰性加载。解释器的执行方式和JIT有一些不同，要使用FindFunctionNamed函数来寻找对应的函数对象，解释器能够获取更全的LLVM字节码的中间信息，例如一些属性和元数据，在做一些灵活的动态语言解释器时是非常有用的。 ~~~ // 给解释器使用的部分 Function* func = EE->FindFunctionNamed(func_name.c_str()); if (func == NULL) { printf("忽略, 找不到函数: %s\n", func_name.c_str()); return; } // 如果需要传参数的话 std::vector<GenericValue> args; args.push_back(GenericValue(NULL)); EE->runFunction(func, args); ~~~ ### 创建测试的C代码我在是Elite编译器工程下开发的，所以会有接口调用的测试，大家可以，创建简单的C函数进行调用测试： ~~~ extern void test２_elite_plugin_init(CodeGenContext* context) { printf("test２_elite_plugin_init\n"); if (context == NULL) printf("Error for context\n"); else context->AddOrReplaceMacros(macro_funcs); } ~~~ 执行结果： ![执行结果](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1bde63d.jpg "") 最近研究的LLVM技术，大部分应用于正在进行的ELite编译器开发，欢迎朋友们关注和参与。 github: [https://github.com/elite-lang/Elite](https://github.com/elite-lang/Elite) 文档: [http://elite-lang.org/doc/zh-cn/](http://elite-lang.org/doc/zh-cn/)

';

（11）深入理解GetElementPtr

最后更新于：2022-04-01 14:36:21

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 深入理解GetElementPtr LLVM平台，和C语言极为类似，强类型，需要复杂的指针操作，基于系统的符号调用等。而LLVM的指针操作指令，GetElementPtr，几乎是所有指针计算的关键，而理解它个运作原理，正确的使用，非常的重要。 ### 强类型的LLVM 编写LLVM需要时刻记住，LLVM是强类型的，每一条语句，都有确定的类型，GetElementPtr也正是这样，不同的参数，会有不同类型的返回类型。我们先来看一段LLVM官网上的示例： ~~~ struct munger_struct { int f1; int f2; }; void munge(struct munger_struct *P) { P[0].f1 = P[1].f1 + P[2].f2; } ... munger_struct Array[3]; ... munge(Array); ~~~ 我们用Clang以C格式编译这段代码，`munge`函数会编译成如下IR： ~~~ void %munge(%struct.munger_struct* %P) { entry: %tmp = getelementptr %struct.munger_struct* %P, i32 1, i32 0 %tmp = load i32* %tmp %tmp6 = getelementptr %struct.munger_struct* %P, i32 2, i32 1 %tmp7 = load i32* %tmp6 %tmp8 = add i32 %tmp7, %tmp %tmp9 = getelementptr %struct.munger_struct* %P, i32 0, i32 0 store i32 %tmp8, i32* %tmp9 ret void } ~~~ 我们仔细来观察一下，每一条指令，都有明确的指明 `P` 指针的类型为 `%struct.munger_struct*`，而下面的load语句，也间接说明了返回类型为 `i32*` 我们在正确理解GetElementPtr的工作方式时，必须时刻了解对应的类型，这样才不会偏差。 ### GetElementPtr的指令规则 GetElementPtr指令其实是一条指针计算语句，本身并不进行任何数据的访问或修改，进行是计算指针，修改计算后指针的类型。 GetElementPtr至少有两个参数，第一个参数为要进行计算的原始指针，往往是一个结构体指针，或数组首地址指针。第二个参数及以后的参数，都称为`indices`，表示要进行计算的参数，如结构体的第几个元素，数组的第几个元素。下面我们结合示例，来对应看一下是如何工作的： ~~~ P[0].f1 ~~~ 这是示例代码中的被赋值指针，我们C语言的经验告诉我们，首先`P[0]`的地址就是数组的首地址，而`f1`又是结构体的第一个参数，那么P的地址就是我们最终要放置数据的结构地址。这条地址计算对应如下语句： ~~~ %tmp9 = getelementptr %struct.munger_struct* %P, i32 0, i32 0 ~~~ 我们发现参数是两个0，这两个0含义不大一样，第一个0是数组计算符，并不会改变返回的类型，因为，我们任何一个指针都可以作为一个数组来使用，进行对应的指针计算，所以这个0并不会省略。第二个0是结构体的计算地址，表示的是结构体的第0个元素的地址，这时，会根据结构体指针的类型，选取其中的元素长度，进行计算，最后返回的则是结构体成员的指针。同理，我们可以对照参考这两条语句： ~~~ P[1].f1 P[2].f2 ~~~ 对应的计算翻译后为： ~~~ %tmp = getelementptr %struct.munger_struct* %P, i32 1, i32 0 %tmp6 = getelementptr %struct.munger_struct* %P, i32 2, i32 1 ~~~ ### 注意事项首先，不是全部的`indices`都必须是i32，也可以是i64，但结构体的计算地址，也就是上面例子中的第二个数字，必须是i32 GEP x,1,0,0 和 GEP x,1 计算后的地址是一样的，但类型不一样，所以千万注意不要在语句后添加多余的0。 ### 其他情况 ### 仅有数组计算如果仅有数组指针计算，那么就简单了许多，数组指针的移动只需要一个参数即可。但如果是仅有结构体指针，那么还是必须两个参数才行 ### 多维数组个人觉得LLVM的数组定义很难写，推荐自己用一维数组代替，比较计算也不复杂。这样高维数组统一化成一维后，都成了基本的指针计算，就非常简单了。 ### 连续选取 GetElementPtr基本上可以认为是不限参数长度的，可以连续选取，于是我们可以实现： ~~~ A->B->C ~~~ 这类连续指向的计算。但个人不推荐这样做，尤其是语法驱动的编译时，也很难做到这点，建议分开一条一条的语句进行执行，分别选取。 ### 参考 [http://llvm.org/docs/GetElementPtr.html](http://llvm.org/docs/GetElementPtr.html) 最近研究的LLVM技术，大部分应用于在进行的ELite编译器开发，欢迎朋友们关注和参与。 [https://github.com/elite-lang/Elite](https://github.com/elite-lang/Elite)

';

（10）变量的存储与读取

最后更新于：2022-04-01 14:36:19

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 变量的存储与读取变量是一款编程语言中的核心，说编译语言是一种符号处理工具，其实是有些道理的。栈式符号表可以方便的记录编译过程中的变量和语法符号，我们上节已经了解了其中的实现方法。那么，还有没有其他的办法能够简单的实现变量的存取呢？ ### LLVM的内置符号表其实LLVM还提供了一个内部符号表，这个和我们的符号表不一样，它的符号是以函数为界的，函数内的是局部符号，外面的是全局符号。这个符号表的作用，主要是供LLVM找到各个底层的语法元素而设计的，所以它的功能较为有限。例如下面这段字节码： ~~~ define void @print(i64 %k1) { entry: ... } ~~~ 我们可以通过符号表，找到k1这个元素。这个符号表的获取也很简单，只要你有basicblock，你就能够找到这个符号表的指针： ~~~ BasicBlock* bb = context->getNowBlock(); ValueSymbolTable* st = bb->getValueSymbolTable(); Value* v = st->lookup(value); ~~~ ### 栈上变量空间的分配，AllocaInst语句 AllocaInst是LLVM的一条标准语句，负责栈上空间的分配，你无需考虑栈的增长的操作，它会自动帮你完成，并返回给你对应空间的指针。千万不要认为这个语句能够动态分配堆内存，堆内存实际上是通过调用Malloc语句来分配的。 ~~~ %k = alloca i64 ~~~ 以上语句，会让k的类型变为你分配类型的指针。这个语句的C++接口非常的好用，像这样： ~~~ AllocaInst *alloc = new AllocaInst(t, var_name, context->getNowBlock()); ~~~ t对应分配的类型，var_name对应语句返回的那个变量名（上面的‘k’），最后一个参数当然是插入的basicblock。这时，返回的语句，就代表k这个指针了。 ### 变量的存储 LLVM中，变量的存储，都需要知道要存储地址的指针，注意，一定是指针，而不是值。原型： ~~~ StoreInst (Value *Val, Value *Ptr, bool isVolatile, BasicBlock *InsertAtEnd) ~~~ 使用示例： ~~~ new StoreInst(value2, value1, false, context->getNowBlock()); ~~~ 这个value1，就是目标的存储指针，而value2则是要放入的值。false表示不是易变的，这个参数相当与C语言中的volatile关键字，主要是防止这个变量在重复读取时的编译器优化。因为一般的编译器优化，都会将一个变量在没有改变情况下的多次读取，认为取到同一个值，虽然这在多线程和硬中断的环境下并不成立。 ### 变量的读取变量的读取，就用Load语句： ~~~ LoadInst (Value *Ptr, const Twine &NameStr, bool isVolatile, unsigned Align, BasicBlock *InsertAtEnd) ~~~ 使用示例： ~~~ new LoadInst(v, "", false, bb); ~~~ 我们这里暂时没有考虑内存对齐的问题，当然，一般在Clang中，都是4字节对齐的。我们注意到，其实Load语句也是从指针中取值的，返回的则是一个值类型。 ### 打造一个赋值语句赋值语句其实是一个挺尴尬的语句，左边要赋值的，应该是一个指针地址，而右边的部分，则应该是一个获取到的值。而之前我们的运算，函数调用等等，绝大部分都是依赖值类型的。我们先要为变量实现一个值的获取，这部分因为很通用，我们放到IDNode节点的代码生成中： ~~~ Value* IDNode::codeGen(CodeGenContext* context) { BasicBlock* bb = context->getNowBlock(); ValueSymbolTable* st = bb->getValueSymbolTable(); Value* v = st->lookup(value); if (v == NULL || v->hasName() == false) { errs() << "undeclared variable " << value << "\n"; return NULL; } Value* load = new LoadInst(v, "", false, bb); return load; } ~~~ value是我们类的成员变量，记录的是变量名。然而赋值语句有时还会要求获取到的是指针，不是值，现在我们要为赋值语句实现一个符号指针的获取： ~~~ Value* IDNode::codeGen(CodeGenContext* context) { BasicBlock* bb = context->getNowBlock(); ValueSymbolTable* st = bb->getValueSymbolTable(); Value* v = st->lookup(value); if (v == NULL || v->hasName() == false) { errs() << "undeclared variable " << value << "\n"; return NULL; } if (context->isSave()) return v; // 我们在上下文类中记录了一个变量，看当前状态是存还是取 Value* load = new LoadInst(v, "", false, bb); return load; } ~~~ 那么我们在调用时，只需要这样做： ~~~ static Value* opt2_macro(CodeGenContext* context, Node* node) { std::string opt = node->getStr(); Node* op1 = (node = node->getNext()); if (node == NULL) return NULL; Node* op2 = (node = node->getNext()); if (node == NULL) return NULL; if (opt == "=") { context->setIsSave(true); // 这两句设置的目前是为下面的节点解析时,返回指针而不是load后的值 Value* ans1 = op1->codeGen(context); context->setIsSave(false); Value* ans2 = op2->codeGen(context); return new StoreInst(ans2, ans1, false, context->getNowBlock()); } ... } ~~~ 其实我们这里也可以单独实现一个函数来处理这个功能，但由于两个函数功能太像，所以也不怎么想添加一个类似的函数了。这个部分暂时先这样处理一下，待整体结构完善后，应该有更好的实现方法。

';

（9）栈式符号表的构建

最后更新于：2022-04-01 14:36:16

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 栈式符号表的构建栈式符号表对于一款编译器，无疑是核心的组件。无论你在做什么符号扫描，那么都离不开符号表，如何得知一个符号是否定义，以及它的类型，那么唯有查看符号表中的记录。栈式符号表并不复杂，但思想精妙，本文，将介绍一款栈式符号表的原理及简单构建。 ### 源代码的例子首先我们来看一段C代码 ~~~ int a[3] = { 100, 10, 1}; int work() { if (a[0] == 100) { // 这里的a指向的是全局符号a int a = 99999; // 重新定义了局部符号下图的符号表是扫描到这里后的情况 for (int i = 0; i< 10; ++i) { a /= 3; // 由于局部符号优先级较高，引用局部符号 } return a; // 局部符号 } return a[0]; // 局部符号生命周期已过，找到全局符号 } ~~~ 于是我们发现，符号表在局部声明变量后，将局部符号增加了，这和全局符号并不冲突，而是优先级不同，越靠近栈顶，越先发现 ![栈式符号表](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1b9c1ed.jpg "") ### 用C++的map和stack构建符号表如果考虑效率的话，最佳选择是用C语言构建符号表，这样操作起来会更快，但我们毕竟目前考虑开发的简便型而言，用C++的map就可以方便地实现符号表。首先我们做一个局部符号表，由于其中不会有重复的符号名，所以我们只要简单的将其存放起来即可。然后符号表还需要记录很多类型信息、指针信息等，我们设计一个结构体表示它们： ~~~ enum SymbolType { var_t, type_t, struct_t, enum_t, delegate_t, function_t }; struct id { int level; SymbolType type; void* data; }; ~~~ 我们目前是简单起见，由于还不知道都可能放哪些东西，例如数组符号，肯定要包含数组长度、维度等信息，各种变量都会包含类型信息，所以我们这里放了一个void*的指针，到时候需要的化，就强制转换一下。这里其实定义一个基类，需要存储的内容去多态派生也是可以的，没做主要是因为可能存放的东西类型很多，暂时先用一个void*，这样可能方便一点。于是我们的局部符号表就有了： ~~~ class IDMap { public: IDMap(); ~IDMap(); id* find(string& str) const; // 查找一个符号 void insert(string& str, int level, SymbolType type, void* data); // 插入一个符号 private: map<string,id*> ID_map; }; ~~~ 我想查找和插入都是C++中map的基础函数，大家应该很轻松就能实现吧。再弄一个栈来存储一个IDMap： ~~~ class IDTable { public: IDTable(); id* find(string& str) const; void insert(string& str,SymbolType type, void* data); void push(); // 压栈和弹栈操作，例如在函数的解析时，需要压栈，一个函数解析完，就弹栈 void pop(); int getLevel(); // 获取当前的层级，如果为0，则说明是只有全局符号了 private: deque<IDMap> ID_stack; }; ~~~ 这里用deque而没用stack的原因是，deque支持随机访问，而stack只能访问栈顶。寻找时，就按照从栈顶到栈底的顺序依次寻找符号： ~~~ id* IDTable::find(string& idname) const { for (auto p = ID_stack.rbegin(); p != ID_stack.rend(); ++p) { const IDMap& imap = *p; id* pid = imap.find(idname); if (pid != NULL) return pid; } return NULL; } ~~~ 插入时，就往栈顶，当前最新的符号表里面插入： ~~~ void IDTable::insert(string& str,SymbolType type, void* data) { IDMap& imap = ID_stack.back(); imap.insert(str,getLevel(), type, data); } ~~~ 这样，一款简易的栈式符号表就构建好了。 ### 附1：Github参考源码 [idmap.h](https://github.com/sunxfancy/RedApple/blob/master/src/idmap.h) [idmap.cpp](https://github.com/sunxfancy/RedApple/blob/master/src/idmap.cpp) [idtable.h](https://github.com/sunxfancy/RedApple/blob/master/src/idtable.h) [idtable.cpp](https://github.com/sunxfancy/RedApple/blob/master/src/idtable.cpp) ### 附2：Graphviz的绘图源码 Graphviz绘图真的非常爽，上面的数据结构图就是用它的dot画的，想了解的朋友可以参考我之前写的 [结构化图形绘制利器Graphviz](http://blog.csdn.net/xfxyy_sxfancy/article/details/49641825)： ~~~ digraph g { graph [ rankdir = "LR" ]; node [ fontsize = "16" shape = "ellipse" ]; edge [ ]; "node0" [ label = "<f0> stack | <f1> | <f2> | ..." shape = "record" ]; "node1" [ label = "<f0> 全局符号 | a | work | | ..." shape = "record" ] "node2" [ label = "<f0> 局部符号 | a | | ..." shape = "record" ] "node0":f1 -> "node1":f0 [ id = 0 ]; "node0":f2 -> "node2":f0 [ id = 1 ]; } ~~~

';

（8）函数的调用及基本运算符

最后更新于：2022-04-01 14:36:14

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 函数的调用及基本运算符之前我们提到了函数的定义，那么，定义好的函数如何调用才行呢？今天我们就来了解一下，函数的调用。 ### 函数调用的宏形式我们去读之前对函数调用的语法树翻译形式： ~~~ printf("%d\n", y); ~~~ 会被翻译为： ~~~ Node String call String printf String %d\n ID y ~~~ 这个宏的名字是call，是个不定参数的： ~~~ (call 函数名参数表... ) ~~~ 于是我们就需要扫描参数表，获取全部调用参数。 ### 调用宏的基本形式调用宏其实很简单，就是不断循环判断有多少参数即可。 ~~~ static Value* call_macro(CodeGenContext* context, Node* node) { // 参数一函数名 // 其余参数要传入的参数 for (Node* p = node->getNext(); p != NULL; p = p->getNext()) { // 循环获取参数 } } ~~~ 另外我们查阅一下LLVM的文档，找到其中CallInst这个指令，LLVM的指令都派生自Instruction，可以发现构建的方法很简单： ~~~ static CallInst * Create (Value *Func, ArrayRef< Value * > Args, const Twine &NameStr, BasicBlock *InsertAtEnd) ~~~ 但是我们发现，Value中要传输的是一个Function对象，如何获取呢？当然还是从符号表中获取，我们下次会讲符号表的实现，这次也和上节一样，将接口先写出来。 ~~~ // 参数一函数名 Value* func = context->getFunction(node); if (func == NULL) { errs() << "找不到函数的定义："; errs() << node->getStr().c_str() << "\n"; exit(1); } ~~~ 函数调用在生成时，如果这个函数还没有被扫描到，那么在生成时会报函数定义找不到的问题，这就是我们为什么要用多遍扫描。只有充分的多次扫描语法树，才能获取每个函数后面的函数定义。虽然像C语言那样强制声明也可以，但我个人不大喜欢这种风格。至于参数的获取，就十分简单的，但有一点要注意，参数是递归生成的，例如： ~~~ printf("%d", add(3, 5)); ~~~ 这时，我们在获取参数时，就会发现，其中一个参数是表达式，那么我们就要先对其进行处理： ~~~ // 其余参数要传入的参数 std::vector<Value*> args; for (Node* p = node->getNext(); p != NULL; p = p->getNext()) { Value* v = p->codeGen(context); // 递归地生成参数 if (v != NULL) args.push_back(v); } ~~~ Node类下面有实现codeGen方法，其作用就是重新调用了完整的对当前节点的代码生成，方便递归调用： ~~~ Value* Node::codeGen(CodeGenContext* context) { return context->MacroMake(this); // MacroMake是核心的代码生成接口 } ~~~ 于是我们递归地生成了这些代码，就可以产生一条Call语句，那么别忘记将其返回给上一层： ~~~ static Value* call_macro(CodeGenContext* context, Node* node) { // 参数一函数名 Value* func = context->getFunction(node); if (func == NULL) { errs() << "找不到函数的定义："; errs() << node->getStr().c_str() << "\n"; exit(1); } // 其余参数要传入的参数 std::vector<Value*> args; for (Node* p = node->getNext(); p != NULL; p = p->getNext()) { Value* v = p->codeGen(context); if (v != NULL) args.push_back(v); } CallInst *call = CallInst::Create(func, args, "", context->getNowBlock()); return call; } ~~~ ### 简单运算符计算对于计算机，加减乘除这些基本运算，就是几个指令而已，但对于编译器，却也要分好几种情况讨论，因为，全部的运算符有这么多： ~~~ // Standard binary operators... FIRST_BINARY_INST( 8) HANDLE_BINARY_INST( 8, Add , BinaryOperator) HANDLE_BINARY_INST( 9, FAdd , BinaryOperator) HANDLE_BINARY_INST(10, Sub , BinaryOperator) HANDLE_BINARY_INST(11, FSub , BinaryOperator) HANDLE_BINARY_INST(12, Mul , BinaryOperator) HANDLE_BINARY_INST(13, FMul , BinaryOperator) HANDLE_BINARY_INST(14, UDiv , BinaryOperator) HANDLE_BINARY_INST(15, SDiv , BinaryOperator) HANDLE_BINARY_INST(16, FDiv , BinaryOperator) HANDLE_BINARY_INST(17, URem , BinaryOperator) HANDLE_BINARY_INST(18, SRem , BinaryOperator) HANDLE_BINARY_INST(19, FRem , BinaryOperator) // Logical operators (integer operands) HANDLE_BINARY_INST(20, Shl , BinaryOperator) // Shift left (logical) HANDLE_BINARY_INST(21, LShr , BinaryOperator) // Shift right (logical) HANDLE_BINARY_INST(22, AShr , BinaryOperator) // Shift right (arithmetic) HANDLE_BINARY_INST(23, And , BinaryOperator) HANDLE_BINARY_INST(24, Or , BinaryOperator) HANDLE_BINARY_INST(25, Xor , BinaryOperator) ~~~ 这些定义很难找，在文档中并没有真正写出来，而是在头文件的`llvm/IR/Instruction.def`里面，这是宏定义的专属部分。这些还仅仅是数值运算，还不算比较运算的部分呢。当然，这和计算机体系结构有关，浮点数的运算和整数肯定是不一样的，而我们知道，右移位也分算数右移和逻辑右移。所以必然，会有大量不同的运算符。创建指令则很简单： ~~~ static BinaryOperator * Create (BinaryOps Op, Value *S1, Value *S2, const Twine &Name, BasicBlock *InsertAtEnd) ~~~ 两个运算数，可以是常量，也可以是变量load出值后，还可以是表达式返回值，只要两个Value调用getType，符合运算规则，就可以。注意，浮点数不能直接和整数运算，必须先将整形转为浮点才可以。于是以下是简单的运算符操作，我只写了整数的运算操作： ~~~ static Value* opt2_macro(CodeGenContext* context, Node* node) { std::string opt = node->getStr(); Node* op1 = (node = node->getNext()); if (node == NULL) return NULL; Node* op2 = (node = node->getNext()); if (node == NULL) return NULL; Instruction::BinaryOps instr; if (opt == "+") { instr = Instruction::Add; goto binOper; } if (opt == "-") { instr = Instruction::Sub; goto binOper; } if (opt == "*") { instr = Instruction::Mul; goto binOper; } if (opt == "/") { instr = Instruction::SDiv; goto binOper; } // 未知运算符 return NULL; binOper: return BinaryOperator::Create(instr, op1->codeGen(context), op2->codeGen(context), "", context->getNowBlock()); ~~~ ### 附：文档参考及源代码 [CallInst类参考](http://llvm.org/doxygen/classllvm_1_1CallInst.html) [BinaryOperator类参考](http://llvm.org/doxygen/classllvm_1_1BinaryOperator.html) [github源码-函数调用及基本运算符](https://github.com/sunxfancy/RedApple/blob/master/src/Macro/Functions.cpp)

';

（7）函数的翻译方法

最后更新于：2022-04-01 14:36:12

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 函数的翻译方法前面介绍了许多编译器架构上面的特点，如何组织语法树、如果多遍扫描语法树。今天开始，我们就要设计本编译器中最核心的部分了，如何设计一个编译时宏，再利用LLVM按顺序生成模块。 ### 设计宏我们的编译器可以说是宏驱动的，因为我们扫描每个语法节点后，都会考察当前是不是一个合法的宏，例如我们来分析一下上一章的示例代码： ~~~ void hello(int k, int g) { ...... } ~~~ 我暂时隐藏了函数体部分，让大家先关注一下函数头 ~~~ String function String void String hello Node Node String set String int String k Node String set String int String g Node ...... ~~~ 我们的语法树的每一层相当于是链表组织的，通过next指针都可以找到下一个元素。而语法树的开头部分，是一个“function”的宏名称，这个部分就是提示我们用哪个宏函数来翻译用的。接下来的节点就是：返回类型，函数名，参数表，函数体例如参数表，里面的内容很多，但我们扫描时，它们是一个整体，进行识别。所以我们的宏的形式实际上就是这样： ~~~ (function 返回类型函数名 (形参表) (函数体)) ~~~ 括号括起来的部分表示是一个列表，而不是一个元素。 ### 宏函数的编写我们之前已经定义了宏的函数形式，我们需要传入的有我们自己的上下文类和当前要处理的Node节点，返回的是LLVM的Value类型（各个语句的抽象基类） ~~~ typedef Value* (*CodeGenFunction)(CodeGenContext*, Node*); ~~~ 于是我们将这个函数实现出来： ~~~ static Value* function_macro(CodeGenContext* context, Node* node) { // 第一个参数, 返回类型 // 第二个参数, 函数名 node = node->getNext(); // 第三个参数, 参数表 Node* args_node = node = node->getNext(); // 第四个参数, 代码块 node = node->getNext(); return F; } ~~~ 获取一个字符串代表的类型，往往不是一件容易的事，尤其在存在结构体和类的情况下，这时，我们往往需要查一下符号表，检查这个字符串是否为类型，以及是什么样的类型，是基本类型、结构体，还是函数指针或者指向其他结构的指针等等。获取类型，往往是LLVM中非常重要的一步。我们这里先写一下查符号表的接口，不做实现，接下来的章节中，我们会介绍经典的栈式符号表的实现。第二个参数是函数名，我们将其保存在临时变量中待用： ~~~ static Value* function_type_macro(CodeGenContext* context, Node* node) { // 第一个参数, 返回类型 Type* ret_type = context->FindType(node); // 第二个参数, 函数名 node = node->getNext(); std::string function_name = node->getStr(); // 第三个参数, 参数表 Node* args_node = node = node->getNext(); // 第四个参数, 代码块 node = node->getNext(); return F; } ~~~ 接下来的参数表也许是很不好实现的一部分，因为其嵌套比较复杂，不过思路还好，就是不断的去扫描节点，这样我们就可以写出如下的代码： ~~~ // 第三个参数, 参数表 Node* args_node = node = node->getNext(); std::vector<Type*> type_vec; // 类型列表 std::vector<std::string> arg_name; // 参数名列表 if (args_node->getChild() != NULL) { for (Node* pC = args_node->getChild(); pC != NULL; pC = pC->getNext() ) { Node* pSec = pC->getChild()->getNext(); Type* t = context->FindType(pSec); type_vec.push_back(t); arg_name.push_back(pSec->getNext()->getStr()); } } ~~~ 其实有了前三个参数，我们就可以构建LLVM中的函数声明了，这样是不用写函数体代码的。 LLVM里很多对象都有这个特点，函数可以只声明函数头，解析完函数体后再将其填回去。结构体也一样，可以先声明符号，回头再向里填入类型信息。这些特性都是方便生成声明的实现，并且在多遍扫描的实现中也会显得很灵活。我们下面来声明这个函数： ~~~ // 先合成一个函数 FunctionType *FT = FunctionType::get(ret_type, type_vec, /*not vararg*/false); Module* M = context->getModule(); Function *F = Function::Create(FT, Function::ExternalLinkage, function_name, M); ~~~ 这里，我们使用了函数类型，这也是派生自Type的其中一个类，函数类型也可以getPointerTo来获取函数指针类型。另外，如果构建函数时，添加了Function::ExternalLinkage参数，就相当于C语言的extern关键字，确定这个函数要导出符号。这样，你写的函数就能够被外部链接，或者作为外部函数的声明使用。 ### 函数的特殊问题接下来我们要创建函数的代码块，但这部分代码实际上和上面的不是在同一个函数中实现的，应该说，他们不是在一趟扫描中。我们知道，如果要让一个函数内的代码块能够调用在任意位置声明的函数，那么我们就必须对所有函数都先处理刚才讲过的前三个参数，这样函数的声明就有了，在之后的正式扫描中，才有了如下的代码块生成部分： ~~~ // 第四个参数, 代码块 node = node->getNext(); BasicBlock* bb = context->createBlock(F); // 创建新的Block // 特殊处理参数表, 这个地方特别坑，你必须给每个函数的参数 // 手动AllocaInst开空间，再用StoreInst存一遍才行，否则一Load就报错 // context->MacroMake(args_node->getChild()); if (args_node->getChild() != NULL) { context->MacroMake(args_node); int i = 0; for (auto arg = F->arg_begin(); i != arg_name.size(); ++arg, ++i) { arg->setName(arg_name[i]); Value* argumentValue = arg; ValueSymbolTable* st = bb->getValueSymbolTable(); Value* v = st->lookup(arg_name[i]); new StoreInst(argumentValue, v, false, bb); } } context->MacroMake(node); // 处理块结尾 bb = context->getNowBlock(); if (bb->getTerminator() == NULL) ReturnInst::Create(*(context->getContext()), bb); return F; ~~~ 这个地方问题非常多，我先保留一个悬念，在接下来代码块和变量存储与加载的讲解中，我会再次提到这个部分的特殊处理。

';

（6）多遍翻译的宏翻译系统

最后更新于：2022-04-01 14:36:09

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 多遍翻译的宏翻译系统上次我们讨论了构建语法树的基本模型，我们能够利用Lex+Bison+Node,几个组件将我们的目标语法翻译成AST语法树了，在第四章，我们也给出了RedApple这款实现型小编译器的语法结构，那么我们的准备工作基于基本完成。我们在搞定了AST语法树的构建后，需要有一种机制，能够遍历整棵语法树，然后将其翻译为LLVM的一个模块，然后再输出成.bc字节码。这种机制我称其为多趟宏翻译系统，因为它要多次扫描整棵语法树，每次扫描需要的部分，然后构建整个模块。我希望能实现类似Java的语法特性，无需考虑定义顺序，只要定义了，那么就能够找到该符号。这样我们就需要合理的扫描顺序。 ### 扫描顺序的确定首先，我们必须先扫描出所有的类型，因为类型的声明很重要，没有类型声明，就无法构建函数。其次，我们要扫描出所有的函数，为其构建函数的声明。最后，我们扫描出所有的函数定义，构建每个函数的函数体。这样我们是三次扫描，无需担心效率问题，因为前两次扫描都是在根节点下一层，扫描的元素非常少，所以处理起来很快。 ### 待扫描的AST语法树这是我们之前生成好的AST语法树，结构还算清晰吧。我们能用的遍历手段也就是上次我们实现的next指针，然后不断的去判断当前节点的数据，然后对应的代码生成出来。为了能够区分每条语句的含义，我在每个列表最前，都添加了翻译宏的名称，这个设计是仿照lisp做的，宏相当于是编译器中的函数，处理元数据，然后将其翻译成对应的内容。例如这段代码： ~~~ void hello(int k, int g) { int y = k + g; printf("%d\n", y); if (k + g < 5) printf("right\n"); } void go(int k) { int a = 0; while (a < k) { printf("go-%d\n", a); a = a + 1; } } void print(int k) { for (int i = 0; i < 10; i = i+1) { printf("hello-%d\n",i); } } void main() { printf("hello world\n"); hello(1,2); print(9); } ~~~ 其AST语法树如下： ~~~ Node Node String function String void String hello Node Node String set String int String k Node String set String int String g Node Node String set String int String y Node String opt2 String + ID k ID g Node String call String printf String %d ID y Node String if Node String opt2 String < Node String opt2 String + ID k ID g Int 5 Node String call String printf String right Node String function String void String go Node Node String set String int String k Node Node String set String int String a Int 0 Node String while Node String opt2 String < ID a ID k Node Node String call String printf String go-%d ID a Node String opt2 String = ID a Node String opt2 String + ID a Int 1 Node String function String void String print Node Node String set String int String k Node Node String for Node String set String int String i Int 0 Node String opt2 String < ID i Int 10 Node String opt2 String = ID i Node String opt2 String + ID i Int 1 Node Node String call String printf String hello-%d ID i Node String function String void String main Node Node Node String call String printf String hello world Node String call String hello Int 1 Int 2 Node String call String print Int 9 ~~~ ### 扫描中的上下文由于翻译过程中，我们还需要LLVMContext变量，符号表，宏定义表等必要信息，我们还需要自己实现一个上下文类，来存储必要的信息，上下文类需要在第一遍扫描前就初始化好。例如我们在翻译中，遇到了一个变量，那么该变量是临时的还是全局的呢？是什么类型，都需要我们在符号表中存储表达，另外当前翻译的语句是属于哪条宏，该怎么翻译？我们必须有一个类来保存这些信息。于是我们先不谈实现，将接口写出来 ~~~ class CodeGenContext; typedef Value* (*CodeGenFunction)(CodeGenContext*, Node*); typedef struct _funcReg { const char* name; CodeGenFunction func; } FuncReg; class CodeGenContext { public: CodeGenContext(Node* node); ~CodeGenContext(); // 必要的初始化方法 void PreInit(); void PreTypeInit(); void Init(); void MakeBegin() { MacroMake(root); } // 这个函数是用来一条条翻译Node宏的 Value* MacroMake(Node* node); // 递归翻译该节点下的所有宏 void MacroMakeAll(Node* node); CodeGenFunction getMacro(string& str); // C++注册宏 // void AddMacros(const FuncReg* macro_funcs); // 为只添加不替换保留 void AddOrReplaceMacros(const FuncReg* macro_funcs); // 代码块栈的相关操作 BasicBlock* getNowBlock(); BasicBlock* createBlock(); BasicBlock* createBlock(Function* f); // 获取当前模块中已注册的函数 Function* getFunction(Node* node); Function* getFunction(std::string& name); void nowFunction(Function* _nowFunc); void setModule(Module* pM) { M = pM; } Module* getModule() { return M; } void setContext(LLVMContext* pC) { Context = pC; } LLVMContext* getContext() { return Context; } // 类型的定义和查找 void DefType(string name, Type* t); Type* FindType(string& name); Type* FindType(Node*); void SaveMacros(); void RecoverMacros(); bool isSave() { return _save; } void setIsSave(bool save) { _save = save; } id* FindST(Node* node) const; id* FindST(string& str) const { return st->find(str); } IDTable* st; private: // 语法树根节点 Node* root; // 当前的LLVM Module Module* M; LLVMContext* Context; Function* nowFunc; BasicBlock* nowBlock; // 这是用来查找是否有该宏定义的 map<string, CodeGenFunction> macro_map; // 这个栈是用来临时保存上面的查询表的 stack<map<string, CodeGenFunction> > macro_save_stack; void setNormalType(); // 用来记录当前是读取还是存入状态 bool _save; }; ~~~ ### 宏的注册宏是内部的非常重要的函数，本身是一个C函数指针，宏有唯一的名字，通过map表，去查找该宏对应的函数，然后调用其对当前的语法节点进行解析。宏函数的定义： ~~~ typedef Value* (*CodeGenFunction)(CodeGenContext*, Node*); ~~~ 注册我是仿照lua的方式设计的，将函数指针组织成数组，然后初始化进入结构体： ~~~ extern const FuncReg macro_funcs[] = { {"function", function_macro}, {"struct", struct_macro}, {"set", set_macro}, {"call", call_macro}, {"opt2", opt2_macro}, {"for", for_macro}, {"while", while_macro}, {"if", if_macro}, {"return", return_macro}, {"new", new_macro}, {NULL, NULL} }; ~~~ 这样写是为了方便我们一次就导入一批函数进入我们的系统。函数指针我还是习惯使用C指针，一般避免使用C++的成员指针，那样太复杂，而且不容易和其他模块链接，因为C++是没有标准ABI的，但C语言有。 ### 实现扫描的引导扫描其实很简单了，如果当前节点是个字符串，而且在宏定义中能够找到，那么我们就调用这条宏来处理，否则如果是列表的化，就对每一条分别递归处理。宏的查找我直接使用了stl模版库中的map和string，非常的方便。 ~~~ Value* CodeGenContext::MacroMake(Node* node) { if (node == NULL) return NULL; if (node->isStringNode()) { StringNode* str_node = (StringNode*)node; CodeGenFunction func = getMacro(str_node->getStr()); if (func != NULL) { return func(this, node->getNext()); } return NULL; } if (node->getChild() != NULL && node->getChild()->isStringNode()) return MacroMake(node->getChild()); Value* ans; for (Node* p = node->getChild(); p != NULL; p = p->getNext()) ans = MacroMake(p); return ans; } CodeGenFunction CodeGenContext::getMacro(string& str) { auto func = macro_map.find(str); if (func != macro_map.end()) return func->second; else return NULL; } ~~~ 就这样，我们可以引导宏翻译了，那多遍翻译是如何实现的呢？其实很简单，使用宏注册函数将当前的宏替换就好了，重新执行翻译引导，不就是多遍翻译了？

';

（5）语法树模型的基本结构

最后更新于：2022-04-01 14:36:07

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 语法树模型的基本结构上次我们看了Lex和Yacc的翻译文件，可能一些朋友并不了解其中的执行部分，而且，对这个抽象语法树是怎么构建起来的还不清楚。今天我们就再详细介绍一下如果方便的构建一棵抽象语法树（AST） ### Node节点链接的左孩子，右兄弟二叉树 AST语法树，由于是一棵多叉树，直接表示不大好弄，所以我们采用计算机树中的一个经典转换，将多叉树转换为左孩子右兄弟的二叉树。 ![这里写图片描述](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1b7de35.jpg "") 其实思路很简单，每一层其实就是一个链表，将兄弟节点串起来，这样就可以了。 ~~~ class Node { public: Node(); Node(Node* n); ~Node(); // 构建列表部分 void addChildren(Node* n); void addBrother (Node* n); static Node* make_list(int num, ...); static Node* getList(Node* node); Node* getNext() { return next; } Node* getChild() { return child; } protected: Node* next; Node* child; }; ~~~ 于是我们构建一个Node类，这就是上次我们脚本中看到的那个节点类。是不是很简单呢？另外我们在写个make_list，方便我们构造一个链表，至于怎么写，我们一会儿再谈。 ### 类型支持我们发现，我们的语法树还不能保存任何数据，我们写AST，是为了在每个节点上存储数据的，有字符串、字符、整数、浮点数、标识符等等。而且不但有这个要求，更重要的是语法树能够方便的构造LLVM语句，所以方便的一个设计就是利用多态，虽然效率或内存占用不像用union那么实在，但确实比较方便。于是我们建立了一堆类，分别从Node派生，当然Node也需要添加一些功能来判断当前的节点类型。 Node.h ~~~ enum NodeType // 类型枚举 { node_t = 0, int_node_t, float_node_t, char_node_t, id_node_t, string_node_t }; class CodeGenContext; class Node { public: Node(); Node(Node* n); // 直接将n作为孩子加入这个节点下 ~Node(); // 构建列表部分 void addChildren(Node* n); void addBrother (Node* n); bool isSingle(); static Node* make_list(int num, ...); static Node* getList(Node* node); void print(int k); // 打印当前节点 Node* getNext() { return next; } Node* getChild() { return child; } virtual Value* codeGen(CodeGenContext* context); LLVM的代码生成 // 这里负责获取或设置当前节点的LLVM类型, 未知类型返回NULL virtual Type* getLLVMType(); virtual void setLLVMType(Type* t); // 如果是含有字符串的节点，则返回所含字符串，否则将报错 std::string& getStr(); // 类型相关 std::string getTypeName(); virtual NodeType getType(); bool isNode(); bool isIntNode(); bool isFloatNode(); bool isIDNode(); bool isStringNode(); bool isCharNode(); protected: virtual void printSelf(); // 打印自己的名字 void init(); Type* llvm_type; Node* next; Node* child; }; ~~~ IDNode.h 是我们的标识符类，就继承自Node，其他类型同理，我就不一一列举，详细代码请参考 [github上的源码](https://github.com/sunxfancy/RedApple) ~~~ #include "Node.h" #include <string> using namespace std; class IDNode: public Node { public: IDNode(const char* _value){ this->value = _value; } IDNode(char _value){ this->value = _value; } std::string& getStr() { return value; } virtual Value* codeGen(CodeGenContext* context); virtual NodeType getType(); protected: virtual void printSelf(); private: string value; }; ~~~ ### AST构建中的一个问题语法树构建时，有一个特别的问题，主要是因为这里有个地方设计的不大好，我没有单独做一个List类型，来存储孩子元素，而是将其直接打包到Node中了。那么当前正等待构建的节点，是一个元素，还是一个元素列表就很难判断。于是我制作了一个isSingle函数来判断当前元素是不是单独的元素，方法就是检测其Next指针是否为空即可。如果是单一元素，构建列表时，可以将其直接插入到当前序列的末尾，如果不是，则新建一个Node节点，然后将其孩子指针指向待插入元素。于是我们的make_list和getList函数就是这样写出来的： ~~~ Node* Node::make_list(int num, ...) { va_list argp; Node* para = NULL; Node* ans = NULL; va_start( argp, num ); for (int i = 0; i < num; ++i) { para = va_arg( argp, Node* ); if (!para->isSingle()) para = new Node(para); if ( ans == NULL ) ans = para; else ans->addBrother(para); } va_end( argp ); return ans; } Node* Node::getList(Node* node) { if (!node->isSingle()) return new Node(node); return node; } ~~~ ### 基本的LLVM语句生成我们构建这么多类的目的是用其生成LLVM语句的，那么我们就先来生成几个简单的语句首先要介绍的是LLVM类型系统的使用，因为LLVM的每条语句都是带有类型的，LLVM语句可以转换成Value型指针，那么我们用如下的方法就可以获取到当前的类型： ~~~ Type* t = value->getType(); ~~~ Type类型也很容易使用，例如获取其指针就可以： ~~~ PointerType* ptr_type = t->getPointerTo(); ~~~ Type类型中还有很多静态函数可供生成基本类型： ~~~ // 获取基本类型 static Type * getVoidTy (LLVMContext &C) static Type * getFloatTy (LLVMContext &C) static Type * getDoubleTy (LLVMContext &C) static Type * getMetadataTy (LLVMContext &C) // 获取不同长度整形类型 static IntegerType * getInt8Ty (LLVMContext &C) static IntegerType * getInt16Ty (LLVMContext &C) static IntegerType * getInt32Ty (LLVMContext &C) static IntegerType * getInt64Ty (LLVMContext &C) // 获取指向不同类型的指针类型 static PointerType * getFloatPtrTy (LLVMContext &C, unsigned AS=0) static PointerType * getDoublePtrTy (LLVMContext &C, unsigned AS=0) static PointerType * getInt8PtrTy (LLVMContext &C, unsigned AS=0) static PointerType * getInt16PtrTy (LLVMContext &C, unsigned AS=0) static PointerType * getInt32PtrTy (LLVMContext &C, unsigned AS=0) static PointerType * getInt64PtrTy (LLVMContext &C, unsigned AS=0) ~~~ 我们刚才AST语法树中的基本类型，其实都是语法中的常量（除了IDNode)，那么这些都应该是生成常量类型常量类型的基类是Constant，而常用的一般是ConstantInt、ConstantFP和ConstantExpr 下面我们就直接写出整形、全局字符串、浮点数对应的LLVM代码 ~~~ Value* IntNode::codeGen(CodeGenContext* context) { Type* t = Type::getInt64Ty(*(context->getContext())); setLLVMType(t); return ConstantInt::get(t, value); } Value* FloatNode::codeGen(CodeGenContext* context) { Type* t = Type::getFloatTy(*(context->getContext())); setLLVMType(t); return ConstantFP::get(t, value); } Value* StringNode::codeGen(CodeGenContext* context) { Module* M = context->getModule(); LLVMContext& ctx = M->getContext(); // 千万别用Global Context Constant* strConstant = ConstantDataArray::getString(ctx, value); Type* t = strConstant->getType(); setLLVMType(t); GlobalVariable* GVStr = new GlobalVariable(*M, t, true, GlobalValue::InternalLinkage, strConstant, ""); Constant* zero = Constant::getNullValue(IntegerType::getInt32Ty(ctx)); Constant* indices[] = {zero, zero}; Constant* strVal = ConstantExpr::getGetElementPtr(GVStr, indices, true); return strVal; } ~~~ 这里最复杂的应该就属常量字符串了，首先，常量字符串要用ConstantDataArray::getString类型，然而，往往函数却不接收一个字符串类型的变量，你需要像C语言一样，将它的首地址作为参数传进去，记得我们之前写过的printf函数的定义么？第一个参数就是一个char*指针。所以我们这里用一条语句，ConstantExpr::getGetElementPtr，对其取地址，indices是一个数组，第一个值是假设指针是个数组，取数组的第几位的地址，第二个值是假设指针指向的是一个结构体，取结构体中第几条元素的地址。这里我们都传常量0就可以了。另外一个需要注意的是，这里取地址的常量0好像不能用int64类型，大概是数据范围太大怕越界吧，一般int32长的数组也够用了。之前我没注意，用int64，总出莫名其妙的问题。 ### 附： Node类的完整实现 ~~~ /* * @Author: sxf * @Date: 2015-09-22 19:21:40 * @Last Modified by: sxf * @Last Modified time: 2015-11-01 21:05:14 */ #include "Node.h" #include <stdarg.h> #include <stdio.h> #include "nodes.h" #include <iostream> void Node::init() { llvm_type = NULL; next = child = NULL; } Node::Node() { init(); } Node::Node(Node* n) { init(); addChildren(n); } Node::~Node() { } void Node::addChildren(Node* n) { if (child == NULL) { child = n; } else { child->addBrother(n); } } void Node::addBrother (Node* n) { Node* p = this; while (p->next != NULL) { p = p->next; } p->next = n; } void Node::print(int k) { for (int i = 0; i < k; ++i) printf(" "); printSelf(); printf("\n"); Node* p = child; int t = 0; while (p != NULL) { p->print(k+1); p = p->next; ++t; } if (t >= 3) printf("\n"); } void Node::printSelf() { printf("Node"); } NodeType Node::getType() { return node_t; } bool Node::isSingle() { return next == NULL; } Node* Node::make_list(int num, ...) { va_list argp; Node* para = NULL; Node* ans = NULL; va_start( argp, num ); for (int i = 0; i < num; ++i) { para = va_arg( argp, Node* ); if (!para->isSingle()) para = new Node(para); if ( ans == NULL ) ans = para; else ans->addBrother(para); } va_end( argp ); return ans; } Node* Node::getList(Node* node) { if (!node->isSingle()) return new Node(node); return node; } Type* Node::getLLVMType() { return llvm_type; } void Node::setLLVMType(Type* t) { llvm_type = t; } bool Node::isNode() { return getType() == node_t; } bool Node::isIntNode() { return getType() == int_node_t; } bool Node::isFloatNode() { return getType() == float_node_t; } bool Node::isIDNode() { return getType() == id_node_t; } bool Node::isStringNode() { return getType() == string_node_t; } bool Node::isCharNode() { return getType() == char_node_t; } std::string Node::getTypeName() { switch (getType()) { case node_t: return "Node"; case int_node_t: return "IntNode"; case string_node_t: return "StringNode"; case id_node_t: return "IDNode"; case char_node_t: return "CharNode"; case float_node_t: return "FloatNode"; } } std::string& Node::getStr() { if (this->isStringNode()) { StringNode* string_this = (StringNode*)this; return string_this->getStr(); } if (this->isIDNode()) { IDNode* string_this = (IDNode*)this; return string_this->getStr(); } std::cerr << "getStr() - 获取字符串错误, 该类型不正确：" << getTypeName() << std::endl; exit(1); } ~~~

';

（4）简单的词法和语法分析

最后更新于：2022-04-01 14:36:05

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) # 简单的词法和语法分析 Lex和Yacc真是太好用了，非常方便我们构建一门语言的分析程序。如果你对Lex和Yacc不了解的话，建议先看下我之前写的两篇文章，分别介绍了Lex和Yacc的用法。 Lex识别C风格字符串和注释 [http://blog.csdn.net/xfxyy_sxfancy/article/details/45024573](http://blog.csdn.net/xfxyy_sxfancy/article/details/45024573) 创造新语言（2）——用Lex&Yacc构建简单的分析程序 [http://blog.csdn.net/xfxyy_sxfancy/article/details/45046465](http://blog.csdn.net/xfxyy_sxfancy/article/details/45046465) ### FLex创建一门语言的词法分析程序我们创建的是一门编程语言，那么词法分析程序就不能像做实验一样那么草率，必须考虑周全，一般一门语言的词法分析程序大概需要囊括如下的几个方面：识别关键字、识别标识符、识别基本常量（数字、浮点数、字符串、字符）、识别注释、识别运算符这些都是非常重要的，而且是一门语言语法中必不可少的部分。于是RedApple的词法分析部分，我就设计成了这样： ~~~ %{ #include <string> #include "Model/nodes.h" #include <list> using namespace std; #include "redapple_parser.hpp" #include "StringEscape.h" #define SAVE_TOKEN yylval.str = maketoken(yytext, yyleng) #define SAVE_STRING yylval.str = makestring(yytext, yyleng, 2) #define SAVE_STRING_NC yylval.str = makestring(yytext, yyleng, 3) extern "C" int yywrap() { return 1; } char* maketoken(const char* data, int len); char* makestring(const char* data, int len, int s); %} %option yylineno %% "/*"([^\*]|(\*)*[^\*/])*(\*)*"*/" ; /* 就是这种注释 */ #[^\n]*\n ; /* 井号注释 */ "//"[^\n]*\n ; /* 双线注释 */ [ \t\v\n\f] ; /* 过滤空白字符 */ "==" return CEQ; "<=" return CLE; ">=" return CGE; "!=" return CNE; "<" return '<'; "=" return '='; ">" return '>'; "(" return '('; ")" return ')'; "[" return '['; "]" return ']'; "{" return '{'; "}" return '}'; "." return '.'; "," return ','; ":" return ':'; ";" return ';'; "+" return '+'; "-" return '-'; "*" return '*'; "/" return '/'; "%" return '%'; "^" return '^'; "&" return '&'; "|" return '|'; "~" return '~'; /* 宏运算符 */ "@" return '@'; ",@" return MBK; /* 下面声明要用到的关键字 */ /* 控制流 */ "if" return IF; "else" return ELSE; "while" return WHILE; "do" return DO; "goto" return GOTO; "for" return FOR; "foreach" return FOREACH; /* 退出控制 */ "break"|"continue"|"exit" SAVE_TOKEN; return KWS_EXIT; "return" return RETURN; /* 特殊运算符 */ "new" return NEW; "this" return THIS; /* 特殊定义 */ "delegate" return DELEGATE; "def" return DEF; "define" return DEFINE; "import" return IMPORT; "using" return USING; "namespace" return NAMESPACE; "try"|"catch"|"finally"|"throw" SAVE_TOKEN; return KWS_ERROR; /* 异常控制 */ "null"|"true"|"false" SAVE_TOKEN; return KWS_TSZ; /* 特殊值 */ "struct"|"enum"|"union"|"module"|"interface"|"class" SAVE_TOKEN; return KWS_STRUCT; /* 结构声明 */ "public"|"private"|"protected" SAVE_TOKEN; return KWS_FWKZ; /* 访问控制 */ "const"|"static"|"extern"|"virtual"|"abstract"|"in"|"out" SAVE_TOKEN; return KWS_FUNC_XS; /* 函数修饰符 */ "void"|"double"|"int"|"float"|"char"|"bool"|"var"|"auto" SAVE_TOKEN; return KWS_TYPE; /* 基本类型 */ [a-zA-Z_][a-zA-Z0-9_]* SAVE_TOKEN; return ID; /* 标识符 */ [0-9]*\.[0-9]* SAVE_TOKEN; return DOUBLE; [0-9]+ SAVE_TOKEN; return INTEGER; \"(\\.|[^\\"])*\" SAVE_STRING; return STRING; /* 字符串 */ @\"(\\.|[^\\"])*\" SAVE_STRING_NC; return STRING; /* 无转义字符串 */ \'(\\.|.)\' SAVE_STRING; return CHAR; /* 字符 */ . printf("Unknown Token!\n"); yyterminate(); %% char* maketoken(const char* data, int len) { char* str = new char[len+1]; strncpy(str, data, len); str[len] = 0; return str; } char* makestring(const char* data, int len, int s) { char* str = new char[len-s+1]; strncpy(str, data+s-1, len-s); str[len-s] = 0; if (s == 3) return str; printf("source: %s\n",str); char* ans = CharEscape(str); printf("escape: %s\n",ans); delete[] str; return ans; } ~~~ 看起来非常的长，但主要多的就是枚举了大量的关键字和运算符，当然，这个你在开发一门语言的前期，不用面面俱到，可以选自己用到的先写，不足的再日后补充。要注意，这里最难的应该就是： ~~~ "/*"([^\*]|(\*)*[^\*/])*(\*)*"*/" ; /* 就是这种注释 */ ~~~ 乍看起来，非常恐怖的正则式，但其实就是在枚举多种可能情况，来保障注释范围的正确性。 ~~~ "/*" ( [^\*] | (\*)* [^\*/] )* (\*)* "*/" ; /* 就是这种注释 */ ~~~ ### 用Bison创建通用的语法分析程序这里我编写的是类C语言的语法，要注意的是，很多情况会造成规约-规约冲突和移入-规约冲突。这里我简要介绍一个bison的工作原理。这种算法在编译原理中，被称为LALR(1)分析法，是自底向上规约的算法之一，而且又会向前看一个token，Bison中的每一行，被称为一个产生式（或BNF范式）例如下面这行： ~~~ def_module_statement : KWS_STRUCT ID '{' def_statements '}' ~~~ 左边的是要规约的节点，冒号右边是描述这个语法节点是用哪些节点产生的。这是一个结构体定义的语法描述，KWS_STRUCT是终结符，来自Lex里的元素，看了上面的Lex描述，你应该能找到它的定义： ~~~ "struct"|"enum"|"union"|"module"|"interface"|"class" SAVE_TOKEN; return KWS_STRUCT; /* 结构声明 */ ~~~ 其实就是可能的一些关键字。而def_statements是另外的语法节点，由其他定义得来。规约-规约冲突，是说，在当前产生式结束后，后面跟的元素还确定的情况下，能够规约到两个不同的语法节点: ~~~ def_module_statement : KWS_STRUCT ID '{' def_statements '}' ; def_class_statement : KWS_STRUCT ID '{' def_statements '}' ; statement : def_module_statement ';' | def_class_statement ';' ; ~~~ 以上文法便会产生规约-规约冲突，这是严重的定义错误，必须加以避免。注意，我为了体现这个语法的错误，特意加上了上下文环境，不是说一样的语法定义会产生规约规约冲突，而是说后面可能跟的终结符都一样时，（在这里是’;’）才会产生规约规约冲突，所以避免这种问题也简单，就是把相似的语法节点合并在一起就可以了。说道移入-规约冲突，就要谈起if-else的摇摆问题： ~~~ if_state : IF '(' expr ')' statement | IF '(' expr ')' statement ELSE statement ; statement : if_state | ... ; ~~~ 正如这个定义一样，在if的前半部识别完成后，下一个元素是ELSE终结符，此时可以规约，可以移入说规约合法的理由是，if_state也是statement，而if第二条statement后面就是ELSE。根据算法，这里规约是合理的，而移入同样是合理的。为了避免这种冲突，一般Bison会优先选择移入，这样ELSE会和最近的IF匹配。所以说，移入-规约冲突在你清楚的知道是哪的问题的时候，可以不加处理。但未期望的移入-规约冲突有可能让你的分析器不正确工作，这点还需要注意。下面是我的Bison配置文件： ~~~ %{ #include "Model/nodes.h" #include <list> using namespace std; #define YYERROR_VERBOSE 1 Node *programBlock; /* the top level root node of our final AST */ extern int yylex(); extern int yylineno; extern char* yytext; extern int yyleng; void yyerror(const char *s); %} /* Represents the many different ways we can access our data */ %union { Node *nodes; char *str; int token; } /* Define our terminal symbols (tokens). This should match our tokens.l lex file. We also define the node type they represent. */ %token <str> ID INTEGER DOUBLE %token <token> CEQ CNE CGE CLE MBK %token <token> '<' '>' '=' '+' '-' '*' '/' '%' '^' '&' '|' '~' '@' %token <str> STRING CHAR %token <token> IF ELSE WHILE DO GOTO FOR FOREACH %token <token> DELEGATE DEF DEFINE IMPORT USING NAMESPACE %token <token> RETURN NEW THIS %token <str> KWS_EXIT KWS_ERROR KWS_TSZ KWS_STRUCT KWS_FWKZ KWS_FUNC_XS KWS_TYPE /* Define the type of node our nonterminal symbols represent. The types refer to the %union declaration above. Ex: when we call an ident (defined by union type ident) we are really calling an (NIdentifier*). It makes the compiler happy. */ %type <nodes> program %type <nodes> def_module_statement %type <nodes> def_module_statements %type <nodes> def_statement %type <nodes> def_statements %type <nodes> for_state %type <nodes> if_state %type <nodes> while_state %type <nodes> statement %type <nodes> statements %type <nodes> block %type <nodes> var_def %type <nodes> func_def %type <nodes> func_def_args %type <nodes> func_def_xs %type <nodes> numeric %type <nodes> expr %type <nodes> call_arg %type <nodes> call_args %type <nodes> return_state //%type <token> operator 这个设计容易引起规约冲突，舍弃 /* Operator precedence for mathematical operators */ %left '~' %left '&' '|' %left CEQ CNE CLE CGE '<' '>' '=' %left '+' '-' %left '*' '/' '%' '^' %left '.' %left MBK '@' %start program %% program : def_statements { programBlock = Node::getList($1); } ; def_module_statement : KWS_STRUCT ID '{' def_statements '}' { $$ = Node::make_list(3, StringNode::Create($1), StringNode::Create($2), $4); } | KWS_STRUCT ID ';' { $$ = Node::make_list(3, StringNode::Create($1), StringNode::Create($2), Node::Create()); } ; def_module_statements : def_module_statement { $$ = Node::getList($1); } | def_module_statements def_module_statement { $$ = $1; $$->addBrother(Node::getList($2)); } ; func_def_xs : KWS_FUNC_XS { $$ = StringNode::Create($1); } | func_def_xs KWS_FUNC_XS {$$ = $1; $$->addBrother(StringNode::Create($2)); } ; def_statement : var_def ';' { $$ = $1; } | func_def | def_module_statement | func_def_xs func_def { $$ = $2; $2->addBrother(Node::getList($1)); } ; def_statements : def_statement { $$ = Node::getList($1); } | def_statements def_statement { $$ = $1; $$->addBrother(Node::getList($2)); } ; statements : statement { $$ = Node::getList($1); } | statements statement { $$ = $1; $$->addBrother(Node::getList($2)); } ; statement : def_statement | expr ';' { $$ = $1; } | block | if_state | while_state | for_state | return_state ; if_state : IF '(' expr ')' statement { $$ = Node::make_list(3, StringNode::Create("if"), $3, $5); } | IF '(' expr ')' statement ELSE statement { $$ = Node::make_list(4, StringNode::Create("if"), $3, $5, $7); } ; while_state : WHILE '(' expr ')' statement { $$ = Node::make_list(3, StringNode::Create("while"), $3, $5); } ; for_state : FOR '(' expr ';' expr ';' expr ')' statement { $$ = Node::make_list(5, StringNode::Create("for"), $3, $5, $7, $9); } | FOR '(' var_def ';' expr ';' expr ')' statement { $$ = Node::make_list(5, StringNode::Create("for"), Node::Create($3), $5, $7, $9); } ; return_state : RETURN ';' { $$ = StringNode::Create("return"); } | RETURN expr ';' { $$ = StringNode::Create("return"); $$->addBrother($2); } block : '{' statements '}' { $$ = Node::Create($2); } | '{' '}' { $$ = Node::Create(); } ; var_def : KWS_TYPE ID { $$ = Node::make_list(3, StringNode::Create("set"), StringNode::Create($1), StringNode::Create($2)); } | ID ID { $$ = Node::make_list(3, StringNode::Create("set"), StringNode::Create($1), StringNode::Create($2)); } | KWS_TYPE ID '=' expr { $$ = Node::make_list(4, StringNode::Create("set"), StringNode::Create($1), StringNode::Create($2), $4); } | ID ID '=' expr { $$ = Node::make_list(4, StringNode::Create("set"), StringNode::Create($1), StringNode::Create($2), $4); } ; func_def : ID ID '(' func_def_args ')' block { $$ = Node::make_list(5, StringNode::Create("function"), StringNode::Create($1), StringNode::Create($2), $4, $6); } | KWS_TYPE ID '(' func_def_args ')' block { $$ = Node::make_list(5, StringNode::Create("function"), StringNode::Create($1), StringNode::Create($2), $4, $6); } | ID ID '(' func_def_args ')' ';' { $$ = Node::make_list(5, StringNode::Create("function"), StringNode::Create($1), StringNode::Create($2), $4); } | KWS_TYPE ID '(' func_def_args ')' ';' { $$ = Node::make_list(5, StringNode::Create("function"), StringNode::Create($1), StringNode::Create($2), $4); } ; func_def_args : var_def { $$ = Node::Create(Node::Create($1)); } | func_def_args ',' var_def { $$ = $1; $$->addChildren(Node::Create($3)); } | %empty { $$ = Node::Create(); } ; numeric : INTEGER { $$ = IntNode::Create($1); } | DOUBLE { $$ = FloatNode::Create($1); } ; expr : expr '=' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("="), $1, $3); } | ID '(' call_args ')' { $$ = Node::make_list(2, StringNode::Create("call"), StringNode::Create($1)); $$->addBrother($3); } | ID { $$ = IDNode::Create($1); } | numeric { $$ = $1; } | STRING { $$ = StringNode::Create($1); } | KWS_TSZ | NEW ID '(' call_args ')' { $$ = Node::make_list(3, StringNode::Create("new"), StringNode::Create($2), $4); } | expr CEQ expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("=="), $1, $3); } | expr CNE expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("!="), $1, $3); } | expr CLE expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("<="), $1, $3); } | expr CGE expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create(">="), $1, $3); } | expr '<' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("<"), $1, $3); } | expr '>' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create(">"), $1, $3); } | expr '+' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("+"), $1, $3); } | expr '-' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("-"), $1, $3); } | expr '*' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("*"), $1, $3); } | expr '/' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("/"), $1, $3); } | expr '%' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("%"), $1, $3); } | expr '^' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("^"), $1, $3); } | expr '&' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("&"), $1, $3); } | expr '|' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("|"), $1, $3); } | expr '.' expr { $$ = Node::make_list(4, StringNode::Create("opt2"), StringNode::Create("."), $1, $3); } | '~' expr { $$ = Node::make_list(4, StringNode::Create("opt1"), StringNode::Create("~"), $2); } | '(' expr ')' /* ( expr ) */ { $$ = $2; } ; call_arg : expr { $$ = $1; } | ID ':' expr { $$ = Node::make_list(3, StringNode::Create(":"), $1, $3); } ; call_args : %empty { $$ = Node::Create(); } | call_arg { $$ = Node::getList($1); } | call_args ',' call_arg { $$ = $1; $$->addBrother(Node::getList($3)); } ; %% void yyerror(const char* s){ fprintf(stderr, "%s \n", s); fprintf(stderr, "line %d: ", yylineno); fprintf(stderr, "text %s \n", yytext); exit(1); } ~~~

';

（3）用代码生成代码

最后更新于：2022-04-01 14:36:02

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记 ### 用代码生成代码 LLVM的开发思路很简单，就是用C++代码去不断生成llvm字节码。 ### RedApple语言示例这是我花了两周多的时间制作的一门实验型语言，主要是想验证一个编译器的设计思路，宏翻译系统。它的架构和一般的编译器很不一样，首先，编译器前端会先将语法转换为很通用的AST语法树节点，一般的编译器，往往是直接在这些节点上进行语义分析，然后进行代码生成。这次我采用了类似lisp的表示方法，将源文件转换为语法树，然后遍历整棵语法树，根据上面标注的宏提示，去按照各个宏的规则进行翻译工作。整个编译器1500行左右的代码，非常的小巧，不过功能也比较有限，而且好多地方还不完善，主要支持的就是函数的定义，结构体的定义，函数调用，结构体访问，分配内存，基本逻辑控制语句这些基本的特性。大家可以作为学习llvm的一个示例吧。 Github地址：[https://github.com/sunxfancy/RedApple](https://github.com/sunxfancy/RedApple) 同样，非常精品的示例还推荐大家看以下两位网友写的：构建Toy编译器：基于Flex、Bison和LLVM [http://lesliezhu.github.io/public/write-your-toy-compiler.html](http://lesliezhu.github.io/public/write-your-toy-compiler.html) 用LLVM来开发自己的编译器系列 [http://my.oschina.net/linlifeng/blog/97457](http://my.oschina.net/linlifeng/blog/97457) 当然，这些示例不是说要大家一下都看懂，那么也就没有教程的意义了，下面我会继续介绍各个关键的LLVM平台API以及相关工具链。大家可以将以上三个项目和LLVM官网example中的作为参考，在实践中加以印证。 ### 工具链简介 | 工具 | 功能 | |-----|-----| | clang -emit-llvm | 指令，可以生成.bc的字节码文件 | | lli | llvm解释器，直接执行.bc字节码文件 | | llc | llvm编译器，将.bc编译成.o | 以上三个最常用，其他小工具备用 | 工具 | 功能 | |-----|-----| | llvm-as | 汇编器 | | llvm-dis | 反汇编器 | | llvm-ar | 打包器 | | llvm-link | 字节码链接器 | 唉，太多了，好多我也木有用过，还有需要的请查看官方文档： [http://llvm.org/docs/CommandGuide/index.html](http://llvm.org/docs/CommandGuide/index.html) ### 常用类 | LLVM类 | 功能 | |-----|-----| | LLVMContext | 上下文类，基本是最核心的保存上下文符号的类 | | Module | 模块类，一般一个文件是一个模块，里面有函数列表和全局变量表 | | Function | 函数类，函数类，生成出来就是一个C函数 | | Constant | 常量类，各种常量的定义，都是从这里派生出来的 | | Value | 各值类型的基类，几乎所以的函数、常量、变量、表达式，都可以转换成Value型 | | Type | 类型类，表示各种内部类型或用户类型，每一个Value都有个getType方法来获取其类型。 | | BasicBlock | 基本块，一般是表示一个标签，注意这个块不是嵌套形式的结构，而是每个块结尾可以用指令跳转到其他块，类似C语言中的标签的功能 | ### 尝试先来生成个小函数就拿printf开练吧，这个函数第一有用，第二简单，第三只要声明不要内容。 ~~~ void register_printf(llvm::Module *module) { std::vector<llvm::Type*> printf_arg_types; // 这里是参数表 printf_arg_types.push_back(llvm::Type::getInt8PtrTy(module->getContext())); llvm::FunctionType* printf_type = llvm::FunctionType::get( llvm::Type::getInt32Ty(module->getContext()), printf_arg_types, true); // 这里的true表示后面接不定参数 llvm::Function *func = llvm::Function::Create( printf_type, llvm::Function::ExternalLinkage, llvm::Twine("printf"), module ); func->setCallingConv(llvm::CallingConv::C); // 一定注意调用方式的正确性 } ~~~ 怎么样，是不是也很简单？ ### 编写主函数和调试上下文下面我们来编写一个主函数，来测试一下我们的函数是否正确，这里，也是LLVM最核心的启动和调试流程。 ~~~ int main(){ InitializeNativeTarget(); LLVMContext Context; Module* M = new Module("main", Context); register_printf(M); // 校验问题, 这个函数需要一个输出流来打印错误信息 if (verifyModule(*M, &errs())) { errs() << "构建LLVM字节码出错!\n"; exit(1); } // 输出llvm字节码 outs() << "LLVM module:\n\n" << *M; outs() << "\n\n"; outs().flush(); // 输出二进制BitCode到.bc文件 std::error_code ErrInfo; raw_ostream *out = new raw_fd_ostream("a.bc", ErrInfo, sys::fs::F_None); WriteBitcodeToFile(M, *out); out->flush(); delete out; // 关闭LLVM释放内存 llvm_shutdown(); return 0; } ~~~ 运行效果： ![这里写图片描述](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1b5c9f8.jpg "") 对了，我们好像没有提该引用哪些头文件，请见附录 ### 附：完整示例只是头文件有点长，具体功能有的我也记不清了，一般我是习惯性把一片粘过去 →_→ ~~~ /* * @Author: sxf * @Date: 2015-11-06 20:37:15 * @Last Modified by: sxf * @Last Modified time: 2015-11-06 20:46:43 */ #include "llvm/IR/Verifier.h" #include "llvm/ExecutionEngine/GenericValue.h" #include "llvm/ExecutionEngine/Interpreter.h" #include "llvm/IR/Constants.h" #include "llvm/IR/DerivedTypes.h" #include "llvm/IR/Instructions.h" #include "llvm/IR/LLVMContext.h" #include "llvm/IR/Module.h" #include "llvm/IR/IRBuilder.h" #include "llvm/Support/ManagedStatic.h" #include "llvm/Support/TargetSelect.h" #include "llvm/Support/raw_ostream.h" #include "llvm/Bitcode/ReaderWriter.h" #include "llvm/Support/FileSystem.h" #include "llvm/IR/ValueSymbolTable.h" using namespace llvm; void register_printf(llvm::Module *module) { std::vector<llvm::Type*> printf_arg_types; printf_arg_types.push_back(llvm::Type::getInt8PtrTy(module->getContext())); llvm::FunctionType* printf_type = llvm::FunctionType::get( llvm::Type::getInt32Ty(module->getContext()), printf_arg_types, true); llvm::Function *func = llvm::Function::Create( printf_type, llvm::Function::ExternalLinkage, llvm::Twine("printf"), module ); func->setCallingConv(llvm::CallingConv::C); } int main(){ InitializeNativeTarget(); LLVMContext Context; Module* M = new Module("main", Context); register_printf(M); // 校验问题, 这个函数需要一个输出流来打印错误信息 if (verifyModule(*M, &errs())) { errs() << "构建LLVM字节码出错!\n"; exit(1); } // 输出llvm字节码 outs() << "LLVM module:\n\n" << *M; outs() << "\n\n"; outs().flush(); // 输出二进制BitCode到.bc文件 std::error_code ErrInfo; raw_ostream *out = new raw_fd_ostream("a.bc", ErrInfo, sys::fs::F_None); WriteBitcodeToFile(M, *out); out->flush(); delete out; // 关闭LLVM释放内存 llvm_shutdown(); return 0; } ~~~

';

（2）开发LLVM项目

最后更新于：2022-04-01 14:36:00

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) ### 开发LLVM项目介绍了LLVM这么多，那么我们能用LLVM做一款自己的编程语言么？答案是，有点难度，但不是不可行。只要你熟悉C++编程，而且有足够的热情，那么就没有什么能阻止你了。下面我就来介绍一下，LLVM项目的基本方法。需要的东西： LLVM平台库，文档，CMAKE，C++编译器 ### 环境搭建首先我的系统是Ubuntu14.04，我就介绍Ubuntu下的配置方法了，用Windows的朋友就不好意思了。安装llvm-3.6及其开发包： ~~~ sudo apt-get install llvm-3.6* ~~~ 一般是推荐将文档和示例都下载下来的，因为比较这些对应版本的参考很重要，很多网上的代码，都是特定版本有效，后来就有API变更的情况。所以大家一定注意版本问题，我开发的时候，源里面的版本最高就3.6，我也不追求什么最新版本，新特性什么的，所以声明一下，本系列教程的LLVM版本均为3.6版，文档参考也为3.6版。 ~~~ sudo apt-get install clang cmake ~~~ clang编译器，我个人感觉比gcc好用许多倍，而且这个编译器就是用llvm作为后端，能够帮助我们编译一些C代码到LLVM中间码，方便我们有个正确的中间码参考。 ### CMAKE管理项目 CMake作为C++项目管理的利器，也是非常好用的一个工具，这样我们就不用自己很烦的写Makefile了，下面是一个CMake示例，同时还带有FLex和Bison的配置： ~~~ cmake_minimum_required(VERSION 2.8) project(RedApple) set(LLVM_TARGETS_TO_BUILD X86) set(LLVM_BUILD_RUNTIME OFF) set(LLVM_BUILD_TOOLS OFF) find_package(LLVM REQUIRED CONFIG) message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}") message(STATUS "Using LLVMConfig.cmake in: ${LLVM_DIR}") find_package(BISON) find_package(FLEX) SET (CMAKE_CXX_COMPILER_ENV_VAR "clang++") SET (CMAKE_CXX_FLAGS "-std=c++11") SET (CMAKE_CXX_FLAGS_DEBUG "-g") SET (CMAKE_CXX_FLAGS_MINSIZEREL "-Os -DNDEBUG") SET (CMAKE_CXX_FLAGS_RELEASE "-O4 -DNDEBUG") SET (CMAKE_CXX_FLAGS_RELWITHDEBINFO "-O2 -g") SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR}/bin) include_directories(${LLVM_INCLUDE_DIRS}) add_definitions(${LLVM_DEFINITIONS}) FLEX_TARGET(MyScanner ${CMAKE_CURRENT_SOURCE_DIR}/src/redapple_lex.l ${CMAKE_CURRENT_BINARY_DIR}/redapple_lex.cpp COMPILE_FLAGS -w) BISON_TARGET(MyParser ${CMAKE_CURRENT_SOURCE_DIR}/src/redapple_parser.y ${CMAKE_CURRENT_BINARY_DIR}/redapple_parser.cpp) ADD_FLEX_BISON_DEPENDENCY(MyScanner MyParser) include_directories(Debug Release build include src src/Model src/Utils) file(GLOB_RECURSE source_files ${CMAKE_CURRENT_SOURCE_DIR}/src/*.cpp ${CMAKE_CURRENT_SOURCE_DIR}/src/Model/*.cpp ${CMAKE_CURRENT_SOURCE_DIR}/src/Macro/*.cpp ${CMAKE_CURRENT_SOURCE_DIR}/src/Utils/*.cpp) add_executable(redapple ${source_files} ${BISON_MyParser_OUTPUTS} ${FLEX_MyScanner_OUTPUTS}) install(TARGETS redapple RUNTIME DESTINATION bin) # Find the libraries that correspond to the LLVM components # that we wish to use llvm_map_components_to_libnames(llvm_libs support core irreader executionengine interpreter mc mcjit bitwriter x86codegen target) # Link against LLVM libraries target_link_libraries(redapple ${llvm_libs}) ~~~ Ubuntu的默认安装，有时LLVM会出bug，cmake找不到许多配置文件，我仔细查看了它的CMake配置，发现有一行脚本路径写错了： /usr/share/llvm-3.6/cmake/ 是llvm的cmake配置路径其中的LLVMConfig.cmake第48行，它原来的路径是这样的： ~~~ set(LLVM_CMAKE_DIR "/usr/share/llvm-3.6/share/llvm/cmake") ~~~ 应该改成： ~~~ set(LLVM_CMAKE_DIR "/usr/share/llvm-3.6/cmake") ~~~ Ubuntu下的llvm文档和示例都在如下目录： /usr/share/doc/llvm-3.6-doc /usr/share/doc/llvm-3.6-examples 我们将example下的HowToUseJIT复制到工作目录中，测试编译一下，懒得找的可以粘我后面附录给的内容。然后再用简单修改后的CMake测试编译一下。项目结构是这样的： ~~~ HowToUseJIT -- src + --- HowToUseJIT.cpp + --- CMakeLists.txt + --- build ~~~ 在项目根目录执行如下指令： ~~~ cd build cmake .. make ~~~ 如果编译通过了，那么恭喜你，你已经会构建LLVM项目了 ### 附： CMakeLists.txt 和 HowToUseJIT.cpp CMakeLists.txt ~~~ cmake_minimum_required(VERSION 2.8) project(llvm_test) set(LLVM_TARGETS_TO_BUILD X86) set(LLVM_BUILD_RUNTIME OFF) set(LLVM_BUILD_TOOLS OFF) find_package(LLVM REQUIRED CONFIG) message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}") message(STATUS "Using LLVMConfig.cmake in: ${LLVM_DIR}") SET (CMAKE_CXX_COMPILER_ENV_VAR "clang++") SET (CMAKE_CXX_FLAGS "-std=c++11") SET (CMAKE_CXX_FLAGS_DEBUG "-g") SET (CMAKE_CXX_FLAGS_MINSIZEREL "-Os -DNDEBUG") SET (CMAKE_CXX_FLAGS_RELEASE "-O4 -DNDEBUG") SET (CMAKE_CXX_FLAGS_RELWITHDEBINFO "-O2 -g") SET(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR}/bin) include_directories(${LLVM_INCLUDE_DIRS}) add_definitions(${LLVM_DEFINITIONS}) file(GLOB_RECURSE source_files "${CMAKE_CURRENT_SOURCE_DIR}/src/*.cpp") add_executable(llvm_test ${source_files}) install(TARGETS llvm_test RUNTIME DESTINATION bin) # Find the libraries that correspond to the LLVM components # that we wish to use llvm_map_components_to_libnames(llvm_libs Core ExecutionEngine Interpreter MC Support nativecodegen) # Link against LLVM libraries target_link_libraries(llvm_test ${llvm_libs}) ~~~ HowToUseJIT.cpp ~~~ //===-- examples/HowToUseJIT/HowToUseJIT.cpp - An example use of the JIT --===// // // The LLVM Compiler Infrastructure // // This file is distributed under the University of Illinois Open Source // License. See LICENSE.TXT for details. // //===----------------------------------------------------------------------===// // // This small program provides an example of how to quickly build a small // module with two functions and execute it with the JIT. // // Goal: // The goal of this snippet is to create in the memory // the LLVM module consisting of two functions as follow: // // int add1(int x) { // return x+1; // } // // int foo() { // return add1(10); // } // // then compile the module via JIT, then execute the `foo' // function and return result to a driver, i.e. to a "host program". // // Some remarks and questions: // // - could we invoke some code using noname functions too? // e.g. evaluate "foo()+foo()" without fears to introduce // conflict of temporary function name with some real // existing function name? // //===----------------------------------------------------------------------===// #include "llvm/ExecutionEngine/GenericValue.h" #include "llvm/ExecutionEngine/Interpreter.h" #include "llvm/IR/Constants.h" #include "llvm/IR/DerivedTypes.h" #include "llvm/IR/IRBuilder.h" #include "llvm/IR/Instructions.h" #include "llvm/IR/LLVMContext.h" #include "llvm/IR/Module.h" #include "llvm/Support/ManagedStatic.h" #include "llvm/Support/TargetSelect.h" #include "llvm/Support/raw_ostream.h" using namespace llvm; int main() { InitializeNativeTarget(); LLVMContext Context; // Create some module to put our function into it. std::unique_ptr<Module> Owner = make_unique<Module>("test", Context); Module *M = Owner.get(); // Create the add1 function entry and insert this entry into module M. The // function will have a return type of "int" and take an argument of "int". // The '0' terminates the list of argument types. Function *Add1F = cast<Function>(M->getOrInsertFunction("add1", Type::getInt32Ty(Context), Type::getInt32Ty(Context), (Type *)0)); // Add a basic block to the function. As before, it automatically inserts // because of the last argument. BasicBlock *BB = BasicBlock::Create(Context, "EntryBlock", Add1F); // Create a basic block builder with default parameters. The builder will // automatically append instructions to the basic block `BB'. IRBuilder<> builder(BB); // Get pointers to the constant `1'. Value *One = builder.getInt32(1); // Get pointers to the integer argument of the add1 function... assert(Add1F->arg_begin() != Add1F->arg_end()); // Make sure there's an arg Argument *ArgX = Add1F->arg_begin(); // Get the arg ArgX->setName("AnArg"); // Give it a nice symbolic name for fun. // Create the add instruction, inserting it into the end of BB. Value *Add = builder.CreateAdd(One, ArgX); // Create the return instruction and add it to the basic block builder.CreateRet(Add); // Now, function add1 is ready. // Now we're going to create function `foo', which returns an int and takes no // arguments. Function *FooF = cast<Function>(M->getOrInsertFunction("foo", Type::getInt32Ty(Context), (Type *)0)); // Add a basic block to the FooF function. BB = BasicBlock::Create(Context, "EntryBlock", FooF); // Tell the basic block builder to attach itself to the new basic block builder.SetInsertPoint(BB); // Get pointer to the constant `10'. Value *Ten = builder.getInt32(10); // Pass Ten to the call to Add1F CallInst *Add1CallRes = builder.CreateCall(Add1F, Ten); Add1CallRes->setTailCall(true); // Create the return instruction and add it to the basic block. builder.CreateRet(Add1CallRes); // Now we create the JIT. ExecutionEngine* EE = EngineBuilder(std::move(Owner)).create(); outs() << "We just constructed this LLVM module:\n\n" << *M; outs() << "\n\nRunning foo: "; outs().flush(); // Call the `foo' function with no arguments: std::vector<GenericValue> noargs; GenericValue gv = EE->runFunction(FooF, noargs); // Import result of execution: outs() << "Result: " << gv.IntVal << "\n"; delete EE; llvm_shutdown(); return 0; } ~~~

';

（1）现代编译器架构

最后更新于：2022-04-01 14:35:58

LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖 —— 题记版权声明：本文为西风逍遥游原创文章，转载请注明出处西风世界 [http://blog.csdn.net/xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) ### 现代编译器架构编译器技术，作为计算机科学的皇后，从诞生起，就不断推进着计算机科学的发展，编译器的发展史，简直就是计算机发展史的缩影，而编译器的架构也逐句变得更加优雅，独立性更强。但说到编译器的架构，可能还留存着编译原理课程的印象，5个经典流程：词法分析 -> 语法分析 -> 语义分析 -> 中间代码优化 -> 目标代码生成一般，我们会将编译器分为一个前端，一个后端，前端负责处理源代码，后端负责生成目标代码。但软件工程，就是在不断的抽象和分层，分层解决问题是重要的特点，分层能够增加层之间的独立性，更好的完成任务。 ### LLVM中间代码优化 LLVM的一大特色就是，有着独立的、完善的、严格约束的中间代码表示。这种中间代码，就是LLVM的字节码，是LLVM抽象的精髓，前端生成这种中间代码，后端自动进行各类优化分析，让用LLVM开发的编译器，都能用上最先见的后端优化技术。 ![](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee19a439e.png) LLVM另外一大特色就是自带JIT，要知道，这可是在原来很难想象的技术，一个编译器要想实现JIT，是需要进行大量努力的，即时翻译代码，还要兼顾效率和编译时间，可不是一件简单的事情。但如果你用上了LLVM，JIT只是其中的副产品，直接就可以使用的。 LLVM将中间代码优化这个流程做到了极致，LLVM工具链，不但可以生成所支持的各个后端平台的代码，更可以方便的将各语言的前端编译后的模块链接到一起，你可以方便的在你的语言中调用C函数。 ![](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-06-03_5750ee1b42a50.png "") ### 可读的中间代码 LLVM中间代码是非常易读的，而且拥有很多高级结构，例如类型和结构体、元数据等，使用起来非常方便。 ~~~ ; Declare the string constant as a global constant. @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00" ; External declaration of the puts function declare i32 @puts(i8* nocapture) nounwind ; Definition of main function define i32 @main() { ; i32()* ; Convert [13 x i8]* to i8 *... %cast210 = getelementptr [13 x i8], [13 x i8]* @.str, i64 0, i64 0 ; Call puts function to write out the string to stdout. call i32 @puts(i8* %cast210) ret i32 0 } ; Named metadata !0 = !{i32 42, null, !"string"} !foo = !{!0} ~~~ 这是一段HelloWorld的LLVM字节码，我们发现很清晰，而且几乎所有的位置都有注明类型，这也是在强调，LLVM是强类型的，每个变量和临时值，都要有明确的类型定义。下面是结构体的声明： ~~~ %mytype = type { %mytype*, i32 } ~~~ 非常遗憾的是，这个结构体的定义只有类型序列信息，没有对应子成员的名称，这是让编译器前端自行保存和查表，来记录这些信息。 C函数的调用非常方便，只需要简单的声明 ~~~ declare i32 @printf(i8* noalias nocapture, ...) declare i32 @atoi(i8 zeroext) ~~~ 你可以将源码用LLVM编译成.bc，然后用llc编译成.o，再拿Clang链接上各个库就可以了。

';

前言

最后更新于：2022-04-01 14:35:56

> 原文出处：[编译器架构的王者LLVM](http://blog.csdn.net/column/details/xf-llvm.html) 作者：[xfxyy_sxfancy](http://blog.csdn.net/xfxyy_sxfancy) **本系列文章经作者授权在看云整理发布，未经作者允许，请勿转载！** # 编译器架构的王者LLVM > LLVM平台，短短几年间，改变了众多编程语言的走向，也催生了一大批具有特色的编程语言的出现，不愧为编译器架构的王者，也荣获2012年ACM软件系统奖。本专题，为大家详细介绍LLVM开发编译器的核心流程和开发经验。

';