* Fix bug on migration 111 * Upgrade bleve to 1.0.10 Co-authored-by: zeripath <art27@cantab.net> Co-authored-by: techknowlogick <techknowlogick@gitea.io>
8.4 KiB
Vendored
ZAP File Format
Legend
Sections
|========|
| | section
|========|
Fixed-size fields
|--------| |----| |--| |-|
| | uint64 | | uint32 | | uint16 | | uint8
|--------| |----| |--| |-|
Varints
|~~~~~~~~|
| | varint(up to uint64)
|~~~~~~~~|
Arbitrary-length fields
|--------...---|
| | arbitrary-length field (string, vellum, roaring bitmap)
|--------...---|
Chunked data
[--------]
[ ]
[--------]
Overview
Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check V
field before the parsing.
|==================================================|
| Stored Fields |
|==================================================|
|-----> | Stored Fields Index |
| |==================================================|
| | Dictionaries + Postings + DocValues |
| |==================================================|
| |---> | DocValues Index |
| | |==================================================|
| | | Fields |
| | |==================================================|
| | |-> | Fields Index |
| | | |========|========|========|========|====|====|====|
| | | | D# | SF | F | FDV | CF | V | CC | (Footer)
| | | |========|====|===|====|===|====|===|====|====|====|
| | | | | |
|-+-+-----------------| | |
| |--------------------------| |
|-------------------------------------|
D#. Number of Docs.
SF. Stored Fields Index Offset.
F. Field Index Offset.
FDV. Field DocValue Offset.
CF. Chunk Factor.
V. Version.
CC. CRC32.
Stored Fields
Stored Fields Index is D#
consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located.
0 [SF] [SF + D# * 8]
| Stored Fields | Stored Fields Index |
|================================|==================================|
| | |
| |--------------------| ||--------|--------|. . .|--------||
| |-> | Stored Fields Data | || 0 | 1 | | D# - 1 ||
| | |--------------------| ||--------|----|---|. . .|--------||
| | | | |
|===|============================|==============|===================|
| |
|-------------------------------------------|
Stored Fields Data is an arbitrary size record, which consists of metadata and Snappy-compressed data.
Stored Fields Data
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
| MDS | CDS | MD | CD |
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
MDS. Metadata size.
CDS. Compressed data size.
MD. Metadata.
CD. Snappy-compressed data.
Fields
Fields Index section located between addresses F
and len(file) - len(footer)
and consist of uint64
values (F1
, F2
, ...) which are offsets to records in Fields section. We have F# = (len(file) - len(footer) - F) / sizeof(uint64)
fields.
(...) [F] [F + F#]
| Fields | Fields Index. |
|================================|================================|
| | |
| |~~~~~~~~|~~~~~~~~|---...---|||--------|--------|...|--------||
||->| Dict | Length | Name ||| 0 | 1 | | F# - 1 ||
|| |~~~~~~~~|~~~~~~~~|---...---|||--------|----|---|...|--------||
|| | | |
||===============================|==============|=================|
| |
|----------------------------------------------|
Dictionaries + Postings
Each of fields has its own dictionary, encoded in Vellum format. Dictionary consists of pairs (term, offset)
, where offset
indicates the position of postings (list of documents) for this particular term.
|================================================================|- Dictionaries +
| | Postings +
| | DocValues
| Freq/Norm (chunked) |
| [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
| |->[ Freq | Norm (float32 under varint) ] |
| | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
| | |
| |------------------------------------------------------------| |
| Location Details (chunked) | |
| [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | |
| |->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | |
| | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | |
| | | |
| |----------------------| | |
| Postings List | | |
| |~~~~~~~~|~~~~~|~~|~~~~~~~~|-----------...--| | |
| |->| F/N | LD | Length | ROARING BITMAP | | |
| | |~~~~~|~~|~~~~~~~~|~~~~~~~~|-----------...--| | |
| | |----------------------------------------------| |
| |--------------------------------------| |
| Dictionary | |
| |~~~~~~~~|--------------------------|-...-| |
| |->| Length | VELLUM DATA : (TERM -> OFFSET) | |
| | |~~~~~~~~|----------------------------...-| |
| | |
|======|=========================================================|- DocValues Index
| | |
|======|=========================================================|- Fields
| | |
| |~~~~|~~~|~~~~~~~~|---...---| |
| | Dict | Length | Name | |
| |~~~~~~~~|~~~~~~~~|---...---| |
| |
|================================================================|
DocValues
DocValues Index is F#
pairs of varints, one pair per field. Each pair of varints indicates start and end point of DocValues slice.
|================================================================|
| |------...--| |
| |->| DocValues |<-| |
| | |------...--| | |
|==|=================|===========================================|- DocValues Index
||~|~~~~~~~~~|~~~~~~~|~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~||
|| DV1 START | DV1 STOP | . . . . . | DV(F#) START | DV(F#) END ||
||~~~~~~~~~~~|~~~~~~~~~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~||
|================================================================|
DocValues is chunked Snappy-compressed values for each document and field.
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
[ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ]
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
Last 16 bytes are description of chunks.
|~~~~~~~~~~~~...~|----------------|----------------|
| Chunk Sizes | Chunk Size Arr | Chunk# |
|~~~~~~~~~~~~...~|----------------|----------------|