pittgoogle.bigquery

Classes to facilitate connections to BigQuery datasets and tables.

Client([auth, client])

A client for interacting with Google BigQuery.

Table(name, dataset[, client, projectid])

Methods and properties for interacting with a Google BigQuery table.


class pittgoogle.bigquery.Client(auth: Auth = None, client: Client | None = None)[source]

A client for interacting with Google BigQuery.


property auth: Auth

Credentials for the Google Cloud project that this client will be connected to.

This will be created using environment variables if necessary.

property client: Client

Google Cloud BigQuery client.

If the client has not been initialized yet, it will be created using Client.auth.

Returns:

An instance of the Google Cloud BigQuery client.

Return type:

google.cloud.bigquery.Client

list_table_names(dataset: str, projectid: str | None = None) list[str][source]

Get the names of the tables in the dataset.

Parameters:
  • dataset (str) – The name of the dataset.

  • projectid (str, optional) – The dataset owner’s Google Cloud project ID. If None, Client.client.project will be used.

Returns:

A list of table names in the dataset.

Return type:

list[str]

Example

bqclient = pittgoogle.bigquery.Client()
bqclient.list_table_names(dataset="ztf", projectid=pittgoogle.ProjectIds().pittgoogle)
query(query: str, to_dataframe: bool = True, to_dataframe_kwargs: dict | None = None, **job_config_kwargs)[source]

Submit a BigQuery query job.

Parameters:
  • query (str) – The SQL query to execute.

  • to_dataframe (bool, optional) – Whether to fetch the results and return them as a pandas DataFrame (True, default) or just return the query job (False).

  • to_dataframe_kwargs (dict, optional) – Keyword arguments to be passed to google.cloud.bigquery.QueryJob.to_dataframe. Notable options: dtypes (dict), max_results (int), create_bqstorage_client (bool). This is ignored unless to_dataframe is True. create_bqstorage_client controls whether to use google.cloud.bigquery_storage (True) or google.cloud.bigquery (False). bigquery_storage can be faster but is not necessary. If you do not specify this parameter, pittgoogle will set it to True if the bigquery_storage library is installed, else False.

  • **job_config_kwargs – Keyword arguments to pass to the google.cloud.bigquery.QueryJobConfig constructor. Notable option: dry_run (bool).

Returns:

pandas.DataFrame if to_dataframe is True, else google.cloud.bigquery.QueryJob

Example

Query two tables (ztf.alerts_v4_02 and ztf.alerts_v3_3) for data on one object (ZTF19acfixfe).

bqclient = pittgoogle.bigquery.Client()
pittgoogle_project = pittgoogle.ProjectIds().pittgoogle

sql = f"""
    SELECT objectId, candid, candidate.jd, candidate.fid, candidate.magpsf
    FROM `{pittgoogle_project}.ztf.alerts_v3_3`
    WHERE objectId = 'ZTF19acfixfe'
    UNION ALL
    SELECT objectId, candid, candidate.jd, candidate.fid, candidate.magpsf
    FROM `{pittgoogle_project}.ztf.alerts_v4_02`
    WHERE objectId = 'ZTF19acfixfe'
"""

diaobject_df = bqclient.query(query=sql)
class pittgoogle.bigquery.Table(name: str, dataset: str, client: Client | None = NOTHING, projectid: str = None)[source]

Methods and properties for interacting with a Google BigQuery table.

Parameters:
  • name (str) – Name of the BigQuery table.

  • dataset (str) – Name of the BigQuery dataset this table belongs to.

  • projectid (str, optional) – The table owner’s Google Cloud project ID. If not provided, the client’s project ID will be used.

  • client (google.cloud.bigquery.Client, optional) – BigQuery client that will be used to access the table. If not provided, a new client will be created the first time it is requested.


client: Client | None

BigQuery client used to access the table.

dataset: str

Name of the BigQuery dataset this table belongs to.

classmethod from_cloud(name: str, *, dataset: str | None = None, survey: str | None = None, testid: str | None = None)[source]

Create a Table object using a BigQuery client with implicit credentials.

Use this method when creating a Table object in code running in Google Cloud (e.g., in a Cloud Run module). The underlying Google APIs will automatically find your credentials.

The table resource in Google BigQuery is expected to already exist.

Parameters:
  • name (str) – Name of the table.

  • dataset (str, optional) – Name of the dataset containing the table. Either this or a survey is required. If a testid is provided, it will be appended to this name following the Pitt-Google naming syntax.

  • survey (str, optional) – Name of the survey. This will be used as the name of the dataset if the dataset kwarg is not provided. This kwarg is provided for convenience in cases where the Pitt-Google naming syntax is used to name resources.

  • testid (str, optional) – Pipeline identifier. If this is not None, False, or “False”, it will be appended to the dataset name. This is used in cases where the Pitt-Google naming syntax is used to name resources. This allows pipeline modules to find the correct resources without interfering with other pipelines that may have deployed resources with the same base names (e.g., for development and testing purposes).

Returns:

The Table object.

Return type:

Table

property id: str

Fully qualified table ID with syntax ‘projectid.dataset.name’.

insert_rows(rows: list[dict | Alert]) list[dict][source]

Insert rows into the BigQuery table.

Parameters:

rows (list[dict or Alert]) – The rows to be inserted. Can be a list of dictionaries or a list of Alert objects.

Returns:

A list of errors encountered.

Return type:

list[dict]

name: str

Name of the BigQuery table.

property projectid: str

The table owner’s Google Cloud project ID.

Defaults to Table.client.client.project.

query(*, columns: list[str] | None = None, where: str | None = None, limit: int | str | None = None, to_dataframe: bool = True, dry_run: bool = False, return_sql: bool = False)[source]

Submit a BigQuery query job. Against this table.

This method supports basic queries against this table. For more complex queries or queries against multiple tables, use Client.query.

Parameters:
  • columns (list[str], optional) – List of columns to select. If None, all columns are selected.

  • where (str, optional) – SQL WHERE clause.

  • limit (int or str, optional) – Maximum number of rows to return.

  • to_dataframe (bool, optional) – Whether to fetch the results and return them as a pandas DataFrame (True, default) or just return the query job (False).

  • dry_run (bool, optional) – Whether to do a dry-run only to check whether the query is valid and estimate costs.

  • return_sql (bool, optional) – If True, the SQL query string will be returned. The query job will not be submitted.

Returns:

The SQL query string if return_sql is True. Otherwise, the results in a DataFrame if to_dataframe is True, else the query job.

Return type:

pandas.DataFrame, google.cloud.bigquery.QueryJob, or str

Example

alerts_tbl = pittgoogle.Table(
    name="alerts_v4_02", dataset="ztf", projectid=pittgoogle.ProjectIds().pittgoogle
)
columns = ["objectId", "candid", "candidate.jd", "candidate.fid", "candidate.magpsf"]
where = "objectId IN ('ZTF18aarunfu', 'ZTF24aavyicb', 'ZTF24aavzkuf')"

diaobjects_df = alerts_tbl.query(columns=columns, where=where)
property schema: pd.DataFrame

Schema of the BigQuery table.

property table: Table

Google Cloud BigQuery Table object.

Makes a get_table request if necessary.

Returns:

The BigQuery Table object, connected to the Cloud resource.

Return type:

google.cloud.bigquery.Table