Lexical Scanner Implementation (3)
- 2018-10-24
- Liu, An-Chi 劉安齊
I found a cool project – TiDB, which has more than 15k stars.
TiDB is an open-source distributed scalable Hybrid Transactional and Analytical Processing (HTAP) database. It features infinite horizontal scalability, strong consistency, and high availability. TiDB is MySQL compatible and serves as a one-stop data warehouse for both OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing) workloads. – From Github
I found this project because an article “How TiDB SQL Parser implement (TiDB SQL Parser 的实现)”(Written in Chinese). The article introduces the parser of TiDB, and it’s quite helpful for me. The parser source code is at pingcap/parser. I refer partly from lexer.go
, which implement the scanner, and misc.go
, which implements the token identifier with trie.
Let’s see some snippets from TiDB
:
!FILENAME pingcap/parser/lexer.go
// Scanner implements the yyLexer interface.
type Scanner struct {
r reader
buf bytes.Buffer
errs []error
stmtStartPos int
// For scanning such kind of comment: /*! MySQL-specific code */ or /*+ optimizer hint */
specialComment specialCommentScanner
sqlMode mysql.SQLMode
}
!FILENAME pingcap/parser/misc.go
func (s *Scanner) isTokenIdentifier(lit string, offset int) int {
// An identifier before or after '.' means it is part of a qualified identifier.
// We do not parse it as keyword.
if s.r.peek() == '.' {
return 0
}
if offset > 0 && s.r.s[offset-1] == '.' {
return 0
}
buf := &s.buf
buf.Reset()
buf.Grow(len(lit))
data := buf.Bytes()[:len(lit)]
for i := 0; i < len(lit); i++ {
if lit[i] >= 'a' && lit[i] <= 'z' {
data[i] = lit[i] + 'A' - 'a'
} else {
data[i] = lit[i]
}
}
checkBtFuncToken, tokenStr := false, string(data)
if s.r.peek() == '(' {
checkBtFuncToken = true
} else if s.sqlMode.HasIgnoreSpaceMode() {
s.skipWhitespace()
if s.r.peek() == '(' {
checkBtFuncToken = true
}
}
if checkBtFuncToken {
if tok := btFuncTokenMap[tokenStr]; tok != 0 {
return tok
}
}
tok := tokenMap[tokenStr]
return tok
}
It is interesting and enlightening to read these code, but not just copy-and-parse. TiDB
use Yacc
to find the hierarchical structure of the program. The Scanner
structure is following the interface of Yacc
. However, I would implement all by myself without any other tools.
I spend too much time reading TiDB
source code, so I just program a little bit today. I think it’s fine that I show the code tomorrow, and it would be more complete.