tiledbsoma.DataFrame

class tiledbsoma.DataFrame(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')

DataFrame is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types.

Every DataFrame must contain a column called soma_joinid, of type int64, with negative values explicitly disallowed. The soma_joinid column contains a unique value for each row in the dataframe, and in some cases (e.g., as part of an Experiment), acts as a join key for other objects, such as SparseNDArray.

Lifecycle

Maturing.

Examples

>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...     [
...         ("soma_joinid", pa.int64()),
...         ("A", pa.float32()),
...         ("B", pa.large_string()),
...     ]
... )
>>> with tiledbsoma.DataFrame.create("./test_dataframe", schema=schema) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
...
soma_joinid: int64
A: float
B: large_string
---
   soma_joinid       A    B
0            0  1.0000  one
1            1  2.7182    e
2            2  3.1214   pi
>>> import pyarrow as pa
>>> import tiledbsoma
>>> schema = pa.schema(
...    [
...        ("soma_joinid", pa.int64()),
...        ("A", pa.float32()),
...        ("B", pa.large_string()),
...    ]
...)
>>> with tiledbsoma.DataFrame.create(
...     "./test_dataframe_2",
...     schema=schema,
...     index_column_names=["A", "B"],
...     domain=[(0.0, 10.0), None],
... ) as df:
...     data = pa.Table.from_pydict(
...         {
...             "soma_joinid": [0, 1, 2],
...             "A": [1.0, 2.7182, 3.1214],
...             "B": ["one", "e", "pi"],
...         }
...     )
...     df.write(data)
>>> with tiledbsoma.DataFrame.open("./test_dataframe_2") as df:
...     print(df.schema)
...     print("---")
...     print(df.read().concat().to_pandas())
soma_joinid: int64
---
        A    B  soma_joinid
0  1.0000  one            0
1  2.7182    e            1
2  3.1214   pi            2

Here the index-column names are specified. The domain is entirely optional: if it’s omitted, defaults will be applied yielding the largest possible domain for each index column’s datatype. If the domain is specified, it must be a tuple/list of equal length to index_column_names. It can be None in a given slot, meaning use the largest possible domain. For string/bytes types, it must be None.

__init__(handle: _WrapperType_co | DataFrameWrapper | DenseNDArrayWrapper | SparseNDArrayWrapper, *, _dont_call_this_use_create_or_open_instead: str = 'unset')

Internal-only common initializer steps.

This function is internal; users should open TileDB SOMA objects using the create() and open() factory class methods.

Methods

__init__(handle, *[, ...])

Internal-only common initializer steps.

exists(uri[, context, tiledb_timestamp])

Finds whether an object of this type exists at the given URI.

create(uri, *, schema[, index_column_names, ...])

Creates the data structure on disk/S3/cloud.

open(uri[, mode, tiledb_timestamp, context, ...])

Opens this specific type of SOMA object.

reopen(mode[, tiledb_timestamp])

Return a new copy of the SOMAObject with the given mode at the current Unix timestamp.

close()

Release any resources held while the object is open.

read([coords, column_names, result_order, ...])

Reads a user-defined subset of data, addressed by the dataframe indexing columns, optionally filtered, and return results as one or more Arrow tables.

write(values[, platform_config])

Writes an Arrow table to the persistent object.

verify_open_for_writing()

Raises an error if the object is not open for writing.

keys()

Returns the names of the columns when read back as a dataframe.

tiledbsoma_upgrade_domain(newdomain[, ...])

Allows you to set the domain of a SOMA DataFrame, when the DataFrame does not have a domain set yet.

change_domain(newdomain[, check_only])

Allows you to enlarge the domain of a SOMA DataFrame, when the DataFrame already has a domain.

tiledbsoma_resize_soma_joinid_shape(newshape)

Increases the shape of the dataframe on the soma_joinid index column, if it indeed is an index column, leaving all other index columns as-is.

tiledbsoma_upgrade_soma_joinid_shape(newshape)

This is like upgrade_domain, but it only applies the specified domain update to the soma_joinid index column.

non_empty_domain()

Retrieves the non-empty domain for each dimension, namely the smallest and largest indices in each dimension for which the array/dataframe has data occupied.

config_options_from_schema()

Returns metadata about the array that is not encompassed within the Arrow Schema, in the form of a PlatformConfig (deprecated).

Attributes

uri

Accessor for the object's storage URI.

soma_type

A string describing the SOMA type of this object.

schema

Returns data schema, in the form of an Arrow Schema.

index_column_names

Returns index (dimension) column names.

count

Returns the number of rows in the dataframe.

domain

Returns tuples of minimum and maximum values, one tuple per index column, currently storable on each index column of the dataframe.

maxdomain

Returns tuples of minimum and maximum values, one tuple per index column, to which the dataframe can have its domain resized.

tiledbsoma_has_upgraded_domain

Returns true if the array has the upgraded resizeable domain feature from TileDB-SOMA 1.15: the array was created with this support, or it has had tiledbsoma_upgrade_domain applied to it.

mode

The mode this object was opened in, either r or w.

closed

True if the object has been closed.

context

A value storing implementation-specific configuration information.

tiledb_timestamp

The time that this object was opened in UTC.

tiledb_timestamp_ms

The time this object was opened, as millis since the Unix epoch.

metadata

The metadata of this SOMA object.