Lexical Scanner Case Study
- 2018-10-24
- Liu, An-Chi 劉安齊
I found it becomes more and more difficult to programing some and record the process on this series. The reason is that it costs huge time for studying and thinking about the next steps. I expected I could write some code every day, but it seems that I would take two or three days for studying on a topic and then implementing functions.
So, let’s talked about my studying today. I spend some time studying the implementation of TiDB that I talked about yesterday. I have some reflection on its methods.
It uses “trie” to identify tokens. Trie is a data structure. It is good at searching words in a dictionary. So, in this case, TiDB
searches tokens in a symbols dictionary.
I am thinking why it uses trie, but I still don’t understand why. As far as I am concerned, I would rather use hash. The searching complexity of trie and hash are both O(N). In this case, we only need to look up tokens, so no need to considering inserting and deleting data. In the condition that with a large amount of words, trie would use less space. However, in our case, symbols are only about a hundred, and adopting hash is much easy and straightforward for me.
No matter MySQL
or TiDB
, they both use Yacc
, and their scanners are just the interface for Yacc
. I would not use Yacc
, on one hand it is written in C++ (I want StellarSQL is pure Rust. Though in fact, Yacc
is run to create codes before the runtime of the program), and the other hand I would like to doing any parts by myself (less third party modules as possible), so the term, “from scratch”, would make sense.
Further, considering the performance, many DBMS implement their own bytes or strings handler for the scanner. I have not designed the protocol of the message yet, and I am just thinking about converting the string to bytes as protocol to make it simple. Anyway, handling bytes by well designed protocol would optimize the performance, but I would just process strings and using the Rust standard library for now. Maybe I will rewrite and refactor this part in the future, but now it’s fine to “keep it simple, keep it stupid”.