Implementing fast queries for local files in Go
Closure as a Bridge
Listing 2 shows the indexer, which uses the Walk()
method from the standard path/filepath package to navigate through a file hierarchy, starting with the start directory specified by the user on the command line. Arguments passed to the program are found in the os.Args
array, as in C, with the program name as the first element and all of the call parameters in the following ones.
Listing 2
index.go
01 package main 02 03 import ( 04 "database/sql" 05 _ "github.com/mattn/go-sqlite3" 06 "os" 07 "path/filepath" 08 ) 09 10 type Walker struct { 11 Db *sql.DB 12 } 13 14 func main() { 15 if len(os.Args) != 2 { 16 panic("usage: " + os.Args[0] + 17 " start_dir") 18 } 19 root := os.Args[1] 20 21 db, err := 22 sql.Open("sqlite3", "./files.db") 23 24 w := &Walker{ 25 Db: db, 26 } 27 28 err = filepath.Walk(root, w.Visit) 29 checkErr(err) 30 31 db.Close() 32 } 33 34 func (w *Walker) Visit(path string, 35 f os.FileInfo, err error) error { 36 stmt, err := w.Db.Prepare( 37 "INSERT INTO files VALUES(?,?,?)") 38 checkErr(err) 39 40 _, err = stmt.Exec( 41 path, f.ModTime().Unix(), f.Size()) 42 checkErr(err) 43 44 return nil 45 } 46 47 func checkErr(err error) { 48 if err != nil { 49 panic(err) 50 } 51 }
Browsing a file tree isn't rocket science, but Go uses the Visit()
callback function to communicate with the traversing function in line 28. The problem here is that no database handle exists within the scope of this callback starting on line 34, which it needs to make the necessary changes to the database. The solution to this dilemma is to turn the Visit()
function into a closure.
To do this, Visit()
in line 34 defines a so-called receiver between the func
keyword and the function name, thus telling Go to connect the Walker
data structure (line 10), which contains a database handle, with the Visit()
function. This allows Visit()
to access the handle via the w
variable used for defining the receiver. With the handle, it inserts new records into the database.
The actual work of setting up a database query is done by the Prepare()
method, which prepares an SQL command and returns a statement handle. Line 40 then fires the Exec
method at the latter and passes the parameters to be stored to the SQL command: the path to the file, its last modification timestamp, and its size.
To avoid the need for the program to check after each function call whether the error variable err
has a value of nil
, and thus everything is OK, line 47 defines a function named checkErr()
, which does this and aborts the program with panic
, if something unforeseen happens.
Finders, Keepers
After the indexer finished its work, the database table files on my computer had more than a million entries, as shown in Figure 3. The reason for the high number of files was probably numerous cloned Git repositories and Snapshot articles from more than 20 years. With this data in the files.db
SQLite database, a SQLite client can now quickly fire off queries and determine which files in my home directory have recently changed, for example.
To do this, Listing 3 connects to the SQLite database and issues a SELECT
command that queries all rows in the table, sorts them in descending order of the timestamp in the modified
column, and then outputs the first 10 matches.
Listing 3
latest.go
01 package main 02 03 import ( 04 "database/sql" 05 "fmt" 06 _ "github.com/mattn/go-sqlite3" 07 ) 08 09 func main() { 10 db, err := 11 sql.Open("sqlite3", "./files.db") 12 checkErr(err) 13 14 rows, err := db.Query("SELECT path, " + 15 "modified FROM files " + 16 "ORDER BY modified DESC LIMIT 10") 17 checkErr(err) 18 19 var path string 20 var mtime string 21 22 for rows.Next() { 23 err = rows.Scan(&path, &mtime) 24 checkErr(err) 25 fmt.Printf("%s %s\n", path, mtime) 26 } 27 } 28 29 func checkErr(err error) { 30 if err != nil { 31 panic(err) 32 } 33 }
The rows.Next()
call in line 22 works its way step-by-step through the matches, and rows.Scan()
retrieves the first two column values of each match and assigns them to the path
and mtime
variables passed in as pointers; both of these were previously declared as strings. Go supports pointers, but it does not leave memory management up to the user and does not blow up in smoke like C if an address is wrong because of a bug; instead, it quits with helpful error messages.
Which files in my home directory take up the most space? Listing 4 finds this out quickly by sorting all entries in descending order (ORDER BY size DESC
) using the SELECT
query from line 25 and LIMIT
ing the output to a maximum number of matches. The user defines this number with the --max-files
parameter at the command line, and Go provides a convenient interface for parsing the parameters of a command with the flag
package.
Listing 4
max-size.go
01 package main 02 03 import ( 04 "database/sql" 05 "fmt" 06 "flag" 07 "os" 08 "strconv" 09 _ "github.com/mattn/go-sqlite3" 10 ) 11 12 func main() { 13 db, err := 14 sql.Open("sqlite3", "./files.db") 15 checkErr(err) 16 17 max_files := flag.Int("max-files", 10, 18 "max number of files") 19 20 flag.Parse() 21 if len(flag.Args()) != 0 { 22 panic("usage: " + os.Args[0]) 23 } 24 25 rows, err := db.Query("SELECT path," + 26 "size FROM files " + 27 "ORDER BY size DESC LIMIT " + 28 strconv.Itoa(*max_files)) 29 checkErr(err) 30 31 var path string 32 var size string 33 34 for rows.Next() { 35 err = rows.Scan(&path, &size) 36 checkErr(err) 37 fmt.Printf("%s %s\n", path, size) 38 } 39 } 40 41 func checkErr(err error) { 42 if err != nil { 43 panic(err) 44 } 45 }
It first expects the declaration of the variable that will hold the value passed in from the command line (max_files
in line 17). The call to the flag.Int()
method specifies that only integers can be used as values. Then flag.Parse()
(line 20) analyzes the existing command-line parameters and – if the user has set --max-files
– assigns this value to a variable that the max_files
pointer references.
The Itoa()
function from the strconv package converts the integer behind the dereferenced *max_files
pointer back into a string, and line 28 injects it into the SQL command using a LIMIT
clause. The advantage of this conversion type is that an integer actually ends up in the query and not a character string that could be abused for SQL injection attacks.
In comparison, Listing 5 shows that a database client in a scripting language like Python is easier to program. Since SQLite also features a Python driver, the same database created by Go earlier can be used by Listing 5 without further ado. It digs out all database entries whose file paths correspond to a predefined pattern. It expects a regular expression at the command line, stuffs it into an SQL query, and outputs the matches.
Listing 5
like.py
01 #!/usr/bin/env python3 02 import sys 03 import sqlite3 04 05 try: 06 _, pattern = sys.argv 07 except: 08 raise SystemExit( 09 "usage: " + sys.argv[0] + " pattern") 10 11 conn = sqlite3.connect('files.db') 12 c = conn.cursor() 13 like = "%" + pattern + "%" 14 for row in c.execute('SELECT path,size FROM files WHERE path LIKE ?', [like]): 15 print(row)
More Luxury, More Lines
Go's type checking and the fact that it does not run inside a bytecode interpreter, but as a compiled binary with more elegant memory management than a C or C++ program, has its price: It requires more detailed instructions and generally more lines of code. Go programs run faster than Python scripts, but, as is so often the case, the bottleneck in the use case at hand is not in processing instructions, but in communicating with external systems. In this case, database calls consume most of the compute time. Whether the program code itself runs 10 or 100 percent faster is largely irrelevant.
However, the compact binary format with embedded libraries and no dependency worries is a big advantage, and probably one of the reasons Go has become the first choice for all types of system programming tasks.
Infos
- Google Code Search: https://github.com/google/codesearch
- Russ Cox, "Regular Expression Matching with a Trigram Index," 2012: https://swtch.com/~rsc/regexp/regexp4.html
- Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/215/
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.