Skip to content

suconghou/log2duckdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

parse nginx access log into duckdb database

解析的日志格式

    log_format vhosts '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for" '
                      '$host $request_length $bytes_sent $upstream_addr '
                      '$upstream_status $request_time $upstream_response_time '
                      '$upstream_connect_time $upstream_header_time';


日志文件解析及相关逻辑复用log2sqlite

main.cpp
parser.cpp
process.cpp

与原版 https://github.com/suconghou/log2sqlite 一致

仅db.cpp重新开发

编译

使用ubuntu24编译

提取duckdb静态库

下载官方 libduckdb-linux-amd64.zip , 解压提取duckdb.hpp文件到本项目

下载官方 static-libs-linux-amd64.zip 文件并解压到本项目

编译本程序

静态编译

g++ -Wall -std=c++20 -flto=auto -static-libstdc++ -static-libgcc --static \
-Wl,-Bstatic,--gc-sections,--start-group \
-O3 -march=x86-64-v3 -mtune=generic -funroll-loops \
-ffunction-sections -fdata-sections -s \
main.cpp \
libduckdb_static.a \
libduckdb_generated_extension_loader.a \
libautocomplete_extension.a \
libcore_functions_extension.a \
libicu_extension.a \
libjemalloc_extension.a \
libjson_extension.a \
libparquet_extension.a \
libtpcds_extension.a \
libtpch_extension.a \
libduckdb_fastpforlib.a \
libduckdb_fmt.a \
libduckdb_fsst.a \
libduckdb_hyperloglog.a \
libduckdb_mbedtls.a \
libduckdb_miniz.a \
libduckdb_pg_query.a \
libduckdb_re2.a \
libduckdb_skiplistlib.a \
libduckdb_utf8proc.a \
libduckdb_yyjson.a \
libduckdb_zstd.a \
-Wl,--end-group \
-o log2duckdb

测试

log2duckdb nginx.log 生成db文件

log2duckdb ngx_log.db 'query sql' 查询

与c++版本log2sqlite相比,插入速度基本相同,每秒插入10W+

测试数据700MB+, 约200万行

版本 时间
c++ 10.105s

sqlite版本见 https://github.com/suconghou/log2sqlite

但是数据库文件相比sqlite小,是其五分之一大小

查询速度比sqlite快很多

'SELECT count(1) as n,request FROM nginx_log GROUP BY request ORDER BY n desc LIMIT 50;' 比sqlite快12倍

'SELECT count(1) as n,remote_addr FROM nginx_log GROUP BY remote_addr ORDER BY n desc LIMIT 50;' 比sqlite快13倍

'SELECT count(1) as n,http_user_agent FROM nginx_log GROUP BY http_user_agent ORDER BY n desc LIMIT 50;' 比sqlite快28倍

'SELECT count(1) as n,request,http_user_agent FROM nginx_log GROUP BY http_user_agent,request ORDER BY n desc LIMIT 50;' 比sqlite快11倍

'SELECT count(1) as n,request,remote_addr FROM nginx_log GROUP BY remote_addr,request ORDER BY n desc LIMIT 50;' 比sqlite快12倍

'SELECT count(1) as n,request,round(avg(request_time),2) FROM nginx_log WHERE request_time > 2 GROUP BY request ORDER BY n desc LIMIT 50;' 比sqlite快6倍

"SELECT request FROM nginx_log WHERE LENGTH(request) - LENGTH(REPLACE(request, ' ', '')) <> 2;" 比sqlite快3倍

log2duckdb nginx.log # 生成db文件
log2duckdb ngx_log.db 'query sql' # 查询