JSONLines
A simple package to read (parts of) a JSON Lines files. The main purpose is to read files that are larger than memory. The two main functions are LineIndex
and LineIterator
which return an index of the rows in the given file and an iterator over the file, respectively. The LineIndex
is Tables.jl
compatible and can directly be piped into e.g. a DataFrame if every row in the result has the same schema (i.e. the same variables). See also materialize
and columnwise
. It allows memory-efficient loading of rows of a JSON Lines file. In order to select the rows skip
and nrows
can be used to index nrows
rows after skipping skip
rows. The file is mmap
ed and only the required rows are loaded into RAM. Files must contain a valid JSON object (denoted by {"String1":ELEMENT1, "String2":ELEMENT2, ...}
) on each line. JSON parsing is done using the JSON3.jl package. Lines can be separated by \n
or \r\n
and some whitespace characters are allowed at the beginning of a line before the JSON object and the newline character (basically all that can be represented as a single UInt8
). Typically a file would look like this:
{"name":"Daniel","organization":"IMSM"}
{"name":"Peter","organization":"StatMath"}
There is experimental support for JSON Arrays on each line where the first line after skip contains the names of the columns.
["name", "organization"]
["Daniel", "IMSM"]
["Peter", "StatMath]
This should work but is not tested thoroughly.
Getting Started
(@v1.5) pkg> add JSONLines
Functions
JSONLines.LineIndex
JSONLines.LineIterator
Base.filter
Base.findall
Base.findfirst
Base.findlast
Base.findnext
Base.findprev
JSONLines.columnnames
JSONLines.columntypes
JSONLines.columnwise
JSONLines.gettypes
JSONLines.materialize
JSONLines.materialize
JSONLines.select
JSONLines.settype!
JSONLines.settypes!
JSONLines.settypes!
JSONLines.writelines
JSONLines.@MStructType
JSONLines.LineIndex
— MethodLineIndex(path::String; filestart::Int = 0, skip::Int = 0, nrows::Int = typemax(Int), structtype = nothing, schemafrom::UnitRange{Int} = 1:10, nworkers::Int = 1)
Create an index of a JSONLines file at path
- Keyword Arguments:
filestart=0
: Number of bytes to skip before reading the fileskip=0
: Number of rows to skip before parsingnrows=typemax(Int)
: Maximum number of rows to indexstructtype=nothing
:StructType
passed toJSON3.read
for each rowschemafrom=1:10
: Rows to parse initially to determine columnnames and columntypesnworkers=1
: Number of threads to use for operations on theLineIndex
JSONLines.LineIterator
— MethodLineIterator(path::String; filestart = 1, structtype = nothing)
Create an iterator of a JSONLines file at path
.
- Keyword Arguments:
filestart=1
: Row at which to start the iteratorstructtype=nothing
: StructType passed toJSON3.read
for each row
Base.filter
— Methodfilter(f::Function, lines::LineIndex)
Return rows of lines
for which f
evaluates to true
Base.findall
— Methodfindall(f::Function, lines::LineIndex)
Return indices of lines
for which f
evaluates to true
Base.findfirst
— Methodfindfirst(f::Function, lines::LineIndex)
Return index of first row for which f
returns true
Base.findlast
— Methodfindlast(f::Function, lines::LineIndex)
Return index of last row for which f
returns true
Base.findnext
— Methodfindnext(f::Function, lines::LineIndex, i::Int)
Return index of next row for which f
returns true
starting at row i
Base.findprev
— Methodfindnext(f::Function, lines::LineIndex, i::Int)
Return index of previous row for which f
returns true
starting at row i
JSONLines.columnnames
— Methodcolumnnames(lines::LineIndex)
Returns the columnnames of the LineIndex
JSONLines.columntypes
— Methodcolumntypes(lines::LineIndex)
Returns current value of columntypes of lines.
JSONLines.columnwise
— Methodcolumnwise(lines::LineIndex; coltypes = lines.columntypes)
Parse lines
to columnwise vectors. Similar to Tables.columntable
JSONLines.gettypes
— Functiongettypes(lines::LineIndex, rows = 1:5)
Infer types of columns in lines
based on rows
selected. Returns Dict
of types.
JSONLines.materialize
— Functionmaterialize(lines::LineIndex, f::Function, rows::Union{UnitRange{Int}, Vector{Int}} = 1:length(lines); eltype = T where T)
Apply f
to rows
selected. eltype
of result can be specified as keyword argument.
JSONLines.materialize
— Functionmaterialize(lines::LineIndex, rows::Union{UnitRange{Int}, Vector{Int}} = 1:length(lines))
Return a Vector{NamedTuple}
of the rows
selected. Similar to Tables.rowtable
JSONLines.select
— Methodselect(path::String, cols...; nworkers = 1) => LineIndex
path
: Path to JSONLines filecols...
: Columnnames to be selected- Keyword Argument:
nworkers=1
: Number of threads to use for operations on the resulting LineIndex
JSONLines.settype!
— Methodsettype!(lines::LineIndex, p::Union{Pair{Symbol, DataType},Pair{Symbol, UnionAll}, Pair{Symbol, Union}})
Set a single columntype using :col => Type
.
JSONLines.settypes!
— Functionsettypes!(lines::LineIndex, rows::Union{UnitRange{Int}, Vector{Int}} = 1:5)
Infer types of columns in lines
based on rows
selected. Overwrites existing types.
JSONLines.settypes!
— Methodsettypes!(lines::LineIndex, d::Union{Dict{Symbol, DataType}, Dict{Symbol, UnionAll}, Dict{Symbol, Union}})
Manually set types for columns. d
is a Dict
in which keys are Symbol
s corresponding to columnnames and values are the datatypes.
JSONLines.writelines
— Methodwritelines(path::String, rows; nworkers = 1, mode = "w")
Write rows
to JSONLines file path
path
: Path to output filerows
:Tables.jl
compatible data- Keyword Arguments:
nworkers=1
: Number of threads to use for parsing to JSONLinesmode="w"
: Mode the file is opened in. See I/O and Network
JSONLines.@MStructType
— Macro@MStructType name fieldnames...
This macro gives a convenient syntax for declaring mutable StructType
s for reading specific variables from a JSONLines file. Also defines row[:col]
access for rows of the resulting type.
name
: Name of theStructType
fieldnames...
: Names of the variables to be read (must be the same as in the file)