User documentation

RecSQL package

RecSQL is a simple module that provides a numpy.record array frontend to an underlying SQLite table.

The SQLarray object populates a SQL table from a numpy record array, a iterable that supplies table records, or a string that contains an especially simple reStructured text table. The SQL table is held in memory and functions are provided to run SQL queries and commands on the underlying database. Queries return record arrays if possible (although a flag can explicitly change this).

Query results are cached to improve performance. This can be disabled (which is recommened for large data sets).

The SQL table is named on initialization. Later one can refer to this table by the name or the magic name __self__ in SQL statements. Additional tables can be added to the same database (by using the connection keyword of the constructor)

The recsql.rest_table module uses the base functionality to parse a restructured text table from a string (such as a doc string) and returns a nicely structured table. This allows for use of parameters that are documented in the doc strings.

See also

PyTables is a high-performance interface to table data. In most cases you will probably better off in the long run using PyTables than recSQL.

Important functions and classes

A SQLarray can be constructed by either reading data from a CSV file or reST table with the SQLarray_fromfile() function or constructed directly from a numpy.recarray via the SQLarray constructor.

recsql.SQLarray_fromfile(filename, **kwargs)

Create a SQLarray from filename.

Uses the filename suffix to detect the contents:
rst, txt
restructure text (see recsql.rest_table
csv
comma-separated (see recsql.csv_table)
Arguments :
filename

name of the file that contains the data with the appropriate file extension

kwargs
class recsql.SQLarray(name=None, records=None, filename=None, columns=None, cachesize=5, connection=None, is_tmp=False, **kwargs)

A SQL table that returns (mostly) rec arrays.

SQLarray([name[, records[, columns[, cachesize=5, connection=None, dbfile=":memory:"]]]])
Arguments :
name

table name (can be referred to as ‘__self__’ in SQL queries)

records

numpy record array that describes the layout and initializes the table OR any iterable (and then columns must be set, too) OR a string that contains a single, simple reStructured text table (and the table name is set from the table name in the reST table.) If None then simply associate with existing table name.

filename

Alternatively to records, read a reStructured table from filename.

columns

sequence of column names (only used if records does not have attribute dtype.names) [None]

cachesize

number of (query, result) pairs that are cached [5]

connection

If not None, reuse this connection; this adds a new table to the same database, which allows more complicated queries with cross-joins. The table’s connection is available as the attribute T.connection. [None]

dbfile

Normally the db is held in memory (”:memory:”) but if a filename is provided then the underlying SQLite db is held on disk and can be accessed and restored (see SQLarray.save()). Only works with connection = None [”:memory:”]

is_tmp

True: create a tmp table; False: regular table in db [False]

Bugs :
  • InterfaceError: Error binding parameter 0 - probably unsupported type

    In this case the recarray contained types such as numpy.int64 that are not understood by sqlite. Either convert the data manually (by setting the numpy dtypes yourself on the recarray, or better: feed a simple list of tuples (“records”) to this class in records. Make sure that these tuples only contain standard python types. Together with records you will also have to supply the names of the data columns in the keyword argument columns.

    If you are reading from a file then it might be simpler to use recsql.sqlarray.SQLarray_fromfile().

SELECT(fields, *args, **kwargs)

Execute a simple SQL SELECT statement and returns values as new numpy rec array.

The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:

SELECT <fields> FROM __self__ [args]

The simplest fields argument is "*".

Example:

Create a recarray in which students with average grade less than 3 are listed:

result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
               "WHERE avg_grade < 3", "GROUP BY surname,subject",
               "ORDER BY avg_grade,surname")

The resulting SQL would be:

SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
     WHERE avg_grade < 3
     GROUP BY surname,subject
     ORDER BY avg_grade,surname

Note how one can use aggregate functions such avg().

The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as

LEFT JOIN __self__ WHERE ...

Note

See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.

close()

Clean up (if no more connections to the db exist).

  • For in-memory: Delete the underlying SQL table from the in-memory database.
  • For on-disk: save and close connection
connection_count

Number of currently open connections to the database.

(Stored in table sqlarray_master.)

has_table(name)

Return True if the table name exists in the database.

limits(variable)

Return minimum and maximum of variable across all rows of data.

merge(recarray, columns=None)

Merge another recarray with the same columns into this table.

Arguments :
recarray

numpy record array that describes the layout and initializes the table

Returns :

n number of inserted rows

Raises :

Raises an exception if duplicate and incompatible data exist in the main table and the new one.

merge_table(name)

Merge an existing table in the database with the __self__ table.

Executes as 'INSERT INTO __self__ SELECT * FROM <name>'. However, this method is probably used less often than the simpler merge().

Arguments :name name of the table in the database (must be compatible with __self__)
Returns :n number of inserted rows
recarray

Return underlying SQL table as a read-only record array.

save()

Commit changes to file.

Only works if the SQLarray was created with they dbfile = FILENAME keyword. There is currently no way to save a in-memory db.

See also

aoft.DB.clone()

selection(SQL, parameters=None, **kwargs)

Return a new SQLarray from a SELECT selection.

This method is useful to build complicated selections and essentially new tables from existing data. The result of the SQL query is stored as a new table in the database. By default, a unique name is created but this can be overridden with the name keyword.

Arguments :
SQL

SQL SELECT query string. A leading SELECT * FROM __self__ WHERE can be omitted (see examples below). The SQL is scrubbed and only data up to the first semicolon is used (note that this means that there cannot even be a semicolon in quotes; if this is a problem, file a bug report and it might be changed).

Keywords :
name

name of the table, None autogenerates a name unique to this query. name may not refer to the parent table itself. [None]

parameters

tuple of values that are safely interpolated into subsequent ? characters in the SQL string

force

If True then an existing table of the same name is DROP``ped first. If ``False and the table already exists then SQL is ignored and a SQLarray of the existing table name is returned. [False]

Returns:a SQLarray referring to the table name in the database; it also inherits the SQLarray.dbfile

Examples:

s = selection('a > 3')
s = selection('a > ?', (3,))
s = selection('SELECT * FROM __self__ WHERE a > ? AND b < ?', (3, 10))
sql(SQL, parameters=None, asrecarray=True, cache=True)

Execute sql statement.

Arguments :
SQL : string

Full SQL command; can contain the ? place holder so that values supplied with the parameters keyword can be interpolated using the pysqlite interface.

parameters : tuple

Parameters for ? interpolation.

asrecarray : boolean

True: return a numpy.recarray if possible; False: return records as a list of tuples. [True]

cache : boolean

Should the results be cached? Set to False for large queries to avoid memory issues. Queries with ? place holders are never cached. [True]

Warning

There are no sanity checks applied to the SQL.

If possible, the returned list of tuples is turned into a numpy record array, otherwise the original list of tuples is returned.

Warning

Potential BUG: if there are memory issues then it can happen that we just silently fall back to a tuple even though calling code expects a recarray; because we swallowed ANY exception the caller will never know

The last cachesize queries are cached (for cache=True) and are returned directly unless the table has been modified.

Note

‘__self__’ is substituted with the table name. See the doc string of the SELECT() method for more details.

sql_index(index_name, column_names, unique=True)

Add a named index on given columns to improve performance.

sql_select(fields, *args, **kwargs)

Execute a simple SQL SELECT statement and returns values as new numpy rec array.

The arguments fields and the additional optional arguments are simply concatenated with additional SQL statements according to the template:

SELECT <fields> FROM __self__ [args]

The simplest fields argument is "*".

Example:

Create a recarray in which students with average grade less than 3 are listed:

result = T.SELECT("surname, subject, year, avg(grade) AS avg_grade",
               "WHERE avg_grade < 3", "GROUP BY surname,subject",
               "ORDER BY avg_grade,surname")

The resulting SQL would be:

SELECT surname, subject, year, avg(grade) AS avg_grade FROM __self__
     WHERE avg_grade < 3
     GROUP BY surname,subject
     ORDER BY avg_grade,surname

Note how one can use aggregate functions such avg().

The string ‘__self__’ is automatically replaced with the table name (T.name); this can be used for cartesian products such as

LEFT JOIN __self__ WHERE ...

Note

See the documentation for sql() for more details on the available keyword arguments and the use of ? parameter interpolation.

Example

>>> from recsql import SQLarray
>>> import numpy
>>> a = numpy.rec.fromrecords(numpy.arange(100).reshape(25,4), names='a,b,c,d')
>>> Q = SQLarray('my_name', a)
>>> print repr(Q.recarray)
rec.array([(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11), (12, 13, 14, 15),
       (16, 17, 18, 19), (20, 21, 22, 23), (24, 25, 26, 27),
       (28, 29, 30, 31), (32, 33, 34, 35), (36, 37, 38, 39),
       (40, 41, 42, 43), (44, 45, 46, 47), (48, 49, 50, 51),
       (52, 53, 54, 55), (56, 57, 58, 59), (60, 61, 62, 63),
       (64, 65, 66, 67), (68, 69, 70, 71), (72, 73, 74, 75),
       (76, 77, 78, 79), (80, 81, 82, 83), (84, 85, 86, 87),
       (88, 89, 90, 91), (92, 93, 94, 95), (96, 97, 98, 99)],
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
>>> Q.SELECT('*', 'WHERE a < 10 AND b > 5')
rec.array([(8, 9, 10, 11)],
    dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])
# creating new SQLarrays:
>>> R = Q.selection('a < 20 AND b > 5')
>>> print R
<recsql.sqlarray.SQLarray object at 0x...>

Additional SQL functions

Note that the SQL database that is used as the backend for SQLarray has a few additional functions defined in addition to the standard SQL available in sqlite. These can be used in SELECT statements and often avoid post-processing of record arrays in python. It is relatively straightforward to add new functions (see the source code and in particular the recsql.sqlarray.SQLarray._init_sql_functions() method; the functions themselves are defined in the module recsql.sqlfunctions).

Simple SQL functions

Simple functions transform a single input value into a single output value:

Expression SQL equivalent
y = f(x) SELECT f(x) AS y

Additional simple functions have been defined:

Simple SQL f() description
sqr(x) square x*x
sqrt(x) square root numpy.sqrt()
pow(x,y) power x**y
periodic(x) wrap angle in degree between -180º and +180º
regexp(pattern,string) string REGEXP pattern
match(pattern,string) string MATCH pattern (anchored REGEXP)
fformat(format,x) string formatting of a single value format % x

Aggregate SQL functions

Aggregate functions combine data from a query; they are typically used with a ‘GROUP BY col’ clause. They can be thought of as numpy ufuncs:

Expression SQL equivalent
y = f(x1,x2,...xN) SELECT f(x) AS y ... GROUP BY x

For completeness, the table also lists sqlite built-in aggregate functions:

Simple aggregate f() description
avg(x) mean [sqlite builtin]
std(x) standard deviation (using N-1 variance)
stdN(x) standard deviation (using N variance), sqrt(<(X - <X>)**2>)
median(x) median of the data (see numpy.median())
min(x) minimum [sqlite builtin]
max(x) maximum [sqlite builtin]

PyAggregate SQL functions

PyAggregate functions act on a list of data points in the same way as ordinary aggregate functions but they return python objects such as numpy arrays, or tuples of numpy arrays (eg bin edges and histogram). In order to make this work, specific types have to be declared when returning the results:

For instance, the histogram() function returns a python Object, the tuple (histogram, edges):

a.sql('SELECT histogram(x) AS "x [Object]" FROM __self__', asrecarray=False)

The return type (‘Object’) needs to be declared with the 'AS "x [Object]"' syntax (note the quotes). (See more details in the sqlite documentation under adapters and converters.) The following table lists all PyAggregate functions that have been defined:

PyAggregate type signature; description
array NumpyArray array(x); a standard numpy.array()
histogram Object histogram(x,nbins,xmin,xmax); histogram x in nbins evenly spaced bins between xmin and xmax
distribution Object distribution(x,nbins,xmin,xmax); normalized histogram whose integral gives 1
meanhistogram Object meanhistogram(x,y,nbins,xmin,xmax); histogram data points y along x and average all y in each bin
stdhistogram Object stdhistogram(x,y,nbins,xmin,xmax); give the standard deviation (from N-1 variance) std(y) = sqrt(Var(y)) with Var(y) = <(y-<y>)^2>
medianhistogram Object medianhistogram((x,y,nbins,xmin,xmax); median(y)
minhistogram Object minhistogram((x,y,nbins,xmin,xmax); min(y)
maxhistogram Object maxhistogram((x,y,nbins,xmin,xmax); max(y)
zscorehistogram Object zscorehistogram((x,y,nbins,xmin,xmax); <abs(y-<y>)>/std(y)

Examples of using types in tables

The following show how to use the special types.

Declare types as ‘NumpyArray’:

a.sql("CREATE TABLE __self__(a NumpyArray)")

Then you can simply insert python objects (type(my_array) == numpy.ndarray):

a.sql("INSERT INTO __self__(a) values (?)", (my_array,))

When returning results of declared columns one does not have to do anything

(my_array,) = a.sql("SELECT a FROM __self__")

although one can also do

(my_array,) = q.sql('SELECT a AS "a [NumpyArray]" FROM __self__')

but when using a PyAggregate the type must be declared:

a.sql('SELECT histogram(x,10,0.0,1.5) as "hist [Object]" FROM __self__')

See also

Other approaches to solving the same problem: esutil.sqlite and hydroclimpy.io.sqlite