tiledbsoma.DataFrame.create

classmethod DataFrame.create(uri: str, *, schema: Schema, index_column_names: Sequence[str] = ('soma_joinid',), domain: Sequence[None | Tuple[Any, Any] | List[Any]] | None = None, platform_config: Dict[str, Mapping[str, Any]] | object | None = None, context: SOMATileDBContext | None = None, tiledb_timestamp: int | datetime | None = None) DataFrame

Creates the data structure on disk/S3/cloud.

Parameters:
  • schemaArrow schema defining the per-column schema. This schema must define all columns, including columns to be named as index columns. If the schema includes types unsupported by the SOMA implementation, an error will be raised.

  • index_column_names – A list of column names to use as user-defined index columns (e.g., ['cell_type', 'tissue_type']). All named columns must exist in the schema, and at least one index column name is required.

  • domain – An optional sequence of tuples specifying the domain of each index column. Each tuple must be a pair consisting of the minimum and maximum values storable in the index column. For example, if there is a single int64-valued index column, then domain might be [(100, 200)] to indicate that values between 100 and 200, inclusive, can be stored in that column. If provided, this sequence must have the same length as index_column_names, and the index-column domain will be as specified. If omitted entirely, or if None in a given dimension, the corresponding index-column domain will use an empty range, and data writes after that will fail with “A range was set outside of the current domain”. Unless you have a particular reason not to, you should always provide the desired domain at create time: this is an optional but strongly recommended parameter. See also change_domain which allows you to expand the domain after create.

  • platform_config – Platform-specific options used to create this array. This may be provided as settings in a dictionary, with options located in the {'tiledb': {'create': ...}} key, or as a TileDBCreateOptions object.

  • tiledb_timestamp – If specified, overrides the default timestamp used to open this object. If unset, uses the timestamp provided by the context.

Returns:

The DataFrame.

Raises:
  • TypeError – If the schema parameter specifies an unsupported type, or if index_column_names specifies a non-indexable column.

  • ValueError – If the index_column_names is malformed or specifies an undefined column name.

  • ValueError – If the schema specifies illegal column names.

  • tiledbsoma.AlreadyExistsError – If the underlying object already exists at the given URI.

  • tiledbsoma.NotCreateableError – If the URI is malformed for a particular storage backend.

  • TileDBError – If unable to create the underlying object.

Examples

>>> df = pd.DataFrame(data={"soma_joinid": [0, 1], "col1": ["a", "b"]})
... with tiledbsoma.DataFrame.create(
...    "a_dataframe", schema=pa.Schema.from_pandas(df)
... ) as soma_df:
...     soma_df.write(pa.Table.from_pandas(df, preserve_index=False))
...
>>> with tiledbsoma.open("a_dataframe") as soma_df:
...     a_df = soma_df.read().concat().to_pandas()
...
>>> a_df
   soma_joinid col1
0            0    a
1            1    b

Lifecycle

Maturing.