JSONLines

A simple package to read (parts of) a JSON Lines files. The main purpose is to read files that are larger than memory. The two main functions are LineIndex and LineIterator which return an index of the rows in the given file and an iterator over the file, respectively. The LineIndex is Tables.jl compatible and can directly be piped into e.g. a DataFrame if every row in the result has the same schema (i.e. the same variables). See also materialize and columnwise. It allows memory-efficient loading of rows of a JSON Lines file. In order to select the rows skip and nrows can be used to index nrows rows after skipping skip rows. The file is mmaped and only the required rows are loaded into RAM. Files must contain a valid JSON object (denoted by {"String1":ELEMENT1, "String2":ELEMENT2, ...}) on each line. JSON parsing is done using the JSON3.jl package. Lines can be separated by \n or \r\n and some whitespace characters are allowed at the beginning of a line before the JSON object and the newline character (basically all that can be represented as a single UInt8). Typically a file would look like this:

{"name":"Daniel","organization":"IMSM"}
{"name":"Peter","organization":"StatMath"}

There is experimental support for JSON Arrays on each line where the first line after skip contains the names of the columns.

["name", "organization"]
["Daniel", "IMSM"]
["Peter", "StatMath]

This should work but is not tested thoroughly.

Getting Started

(@v1.5) pkg> add JSONLines

Functions

JSONLines.LineIndexMethod
LineIndex(path::String; filestart::Int = 0, skip::Int = 0, nrows::Int = typemax(Int), structtype = nothing, schemafrom::UnitRange{Int} = 1:10, nworkers::Int = 1)

Create an index of a JSONLines file at path

  • Keyword Arguments:
    • filestart=0: Number of bytes to skip before reading the file
    • skip=0: Number of rows to skip before parsing
    • nrows=typemax(Int): Maximum number of rows to index
    • structtype=nothing: StructType passed to JSON3.read for each row
    • schemafrom=1:10: Rows to parse initially to determine columnnames and columntypes
    • nworkers=1: Number of threads to use for operations on the LineIndex
source
JSONLines.LineIteratorMethod
LineIterator(path::String; filestart = 1, structtype = nothing)

Create an iterator of a JSONLines file at path.

  • Keyword Arguments:
    • filestart=1: Row at which to start the iterator
    • structtype=nothing: StructType passed to JSON3.read for each row
source
Base.filterMethod
filter(f::Function, lines::LineIndex)

Return rows of lines for which f evaluates to true

source
Base.findallMethod
findall(f::Function, lines::LineIndex)

Return indices of lines for which f evaluates to true

source
Base.findfirstMethod
findfirst(f::Function, lines::LineIndex)

Return index of first row for which f returns true

source
Base.findlastMethod
findlast(f::Function, lines::LineIndex)

Return index of last row for which f returns true

source
Base.findnextMethod
findnext(f::Function, lines::LineIndex, i::Int)

Return index of next row for which f returns true starting at row i

source
Base.findprevMethod
findnext(f::Function, lines::LineIndex, i::Int)

Return index of previous row for which f returns true starting at row i

source
JSONLines.columnwiseMethod
columnwise(lines::LineIndex; coltypes = lines.columntypes)

Parse lines to columnwise vectors. Similar to Tables.columntable

source
JSONLines.gettypesFunction
gettypes(lines::LineIndex, rows = 1:5)

Infer types of columns in lines based on rows selected. Returns Dict of types.

source
JSONLines.materializeFunction
materialize(lines::LineIndex, rows::Union{UnitRange{Int}, Vector{Int}} = 1:length(lines))

Return a Vector{NamedTuple} of the rows selected. Similar to Tables.rowtable

source
JSONLines.materializeFunction
materialize(lines::LineIndex,  f::Function, rows::Union{UnitRange{Int}, Vector{Int}} = 1:length(lines); eltype = T where T)

Apply f to rows selected. eltype of result can be specified as keyword argument.

source
JSONLines.readcolsMethod
readcols(path::String, cols...; nworkers = 1) => LineIndex
  • path: Path to JSONLines file
  • cols...: Columnnames to be selected
  • Keyword Argument:
    • nworkers=1: Number of threads to use for operations on the resulting LineIndex
source
JSONLines.settype!Method
settype!(lines::LineIndex, p::Union{Pair{Symbol, DataType},Pair{Symbol, UnionAll}, Pair{Symbol, Union}})

Set a single columntype using :col => Type.

source
JSONLines.settypes!Function
settypes!(lines::LineIndex, rows::Union{UnitRange{Int}, Vector{Int}} = 1:5)

Infer types of columns in lines based on rows selected. Overwrites existing types.

source
JSONLines.settypes!Method
settypes!(lines::LineIndex, d::Union{Dict{Symbol, DataType}, Dict{Symbol, UnionAll}, Dict{Symbol, Union}})

Manually set types for columns. d is a Dict in which keys are Symbols corresponding to columnnames and values are the datatypes.

source
JSONLines.writelinesMethod
writelines(path::String, rows; nworkers = 1, mode = "w")

Write rows to JSONLines file path

  • path: Path to output file
  • rows: Tables.jl compatible data
  • Keyword Arguments:
    • nworkers=1: Number of threads to use for parsing to JSONLines
    • mode="w": Mode the file is opened in. See I/O and Network
source
JSONLines.@MStructTypeMacro

@MStructType name fieldnames...

This macro gives a convenient syntax for declaring mutable StructTypes for reading specific variables from a JSONLines file. Also defines row[:col] access for rows of the resulting type.

  • name: Name of the StructType
  • fieldnames...: Names of the variables to be read (must be the same as in the file)
source