Databricks
Customers on an Enterprise or Growth plan can access Data Pipeline as an add-on package. See our pricing page for more details.
The Databricks export destination is currently in beta.
Export your Mixpanel data to Databricks using Unity Catalog Managed Volumes. This integration supports all Databricks clouds (AWS, GCP, and Azure).
Design
Mixpanel exports data to customer’s Databricks workspace using Unity Catalog Managed Volumes. We first load the data into a single-column raw (VARIANT type) data table, then create a view to expose all properties as typed columns.
Supported Features
- Cross-cloud Support: Works with Databricks on AWS, GCP, or Azure
- Date Partitioning: Raw tables are clustered by
event_datefor efficient date-based queries - Static IP Addresses: Supports IP allowlisting for secure connections
IP Allowlist
Mixpanel Data Pipelines supports static IP addresses for Databricks connections when IP restrictions are configured on your Databricks workspace. If you are using network policies to restrict access to your instance, you might need to add the following IP addresses to the allowed list:
US
34.31.112.201
35.184.21.33
35.225.176.74EU
34.147.68.192
35.204.164.122
35.204.177.251Prerequisites
Before setting up the integration, ensure you have:
- A Databricks workspace with Unity Catalog enabled
- Databricks Runtime 15.3+ (for VARIANT type support)
- A SQL Warehouse (Serverless recommended for best performance and cost)
- Admin permissions in your Databricks workspace to create Service Principals
Set Export Permissions
Step 1: Create a Service Principal (or use existing one)
A Service Principal is a Databricks identity that Mixpanel will use to access your workspace.
- In your Databricks workspace, navigate to Settings → Identity and access → Service principals
- Click Add service principal
- Click Add new
- Note the Application ID - you’ll need this later
Step 2: Generate OAuth Secret
- Click on the Service Principal you just created
- Navigate to the Secrets tab
- Click Generate secret and enter lifetime (730 days recommended)
- Copy the Secret value immediately - it won’t be shown again
- Store it securely - you’ll need it for Mixpanel configuration
Step 3: Create Catalog and Schema (or use existing ones)
Create a dedicated catalog and schema for Mixpanel data. You can run SQL under SQL Editor or Catalog Explorer.
Note that you need CREATE CATALOG privilege to create a new catalog: 1. Go
to Catalog → settings (gear icon) → metastore 2. Click “Grant” button in
“Permissions” tab and grant “CREATE CATALOG” to the user
-- Create Catalog
CREATE CATALOG IF NOT EXISTS mixpanel_export;
-- Create schema
CREATE SCHEMA IF NOT EXISTS mixpanel_export.json_pipelines;Step 4: Grant Permissions to Service Principal
Grant the Service Principal required permissions to operate within the catalog.
GRANT USE CATALOG ON CATALOG mixpanel_export
TO `<service-principal-application-id>`;
GRANT USE SCHEMA ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;
GRANT CREATE TABLE ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;
GRANT CREATE VOLUME ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;For Metastore Privilege Version 1.1+, additionally grant “CREATE VIEW” on schema:
- Version 1.0: privilege
CREATE TABLEcovers both tables and views - Version 1.1+: Separate
CREATE VIEWprivilege required
-- Check your metastore privilege version by PrivilegeModelVersion
DESCRIBE METASTORE;
GRANT CREATE VIEW ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;Why These Permissions?
USE CATALOG: Required to access the catalogUSE SCHEMA: Required to access objects in the schemaCREATE TABLE: Create raw tables to store event dataCREATE VOLUME: Create temporary volumes for uploading filesCREATE VIEW: Create views with typed columns (metastore v1.1+ only)
Note: The Service Principal automatically becomes the owner of all objects it creates (tables, views, volumes), which grants it full permissions (SELECT, MODIFY, READ FILES, WRITE FILES, etc.) on those objects. No additional grants needed.
Step 5: Grant SQL Warehouse Access
The Service Principal needs permission to use the SQL Warehouse to execute queries.
- In your Databricks workspace, navigate to SQL Warehouses
- Click on your SQL Warehouse (or create one if needed)
- Go to the Permissions tab
- Click Add or Grant permissions
- Search for your Service Principal by Application ID (from Step 1)
- Select permission level: Can use (minimum required)
- Click Save
Why this is needed: SQL Warehouses have their own access control separate from data permissions. Even with all catalog/schema/table permissions, the Service Principal cannot execute queries without warehouse access.
Step 6: Get SQL Warehouse Connection Details
- In the same SQL Warehouse, go to the Connection details tab
- Note the following values:
- Server hostname: e.g.,
abc123.cloud.databricks.com - HTTP Path: e.g.,
/sql/1.0/warehouses/xyz789
- Server hostname: e.g.,
Recommendation: Use a Serverless SQL Warehouse for:
- Fast startup (~3 seconds)
- Auto-scaling
- Pay-per-use pricing
- No idle cluster costs
Step 7: Configure Mixpanel Integration
Refer to Step 2: Creating the Pipeline to create data pipeline via UI. You’ll need to provide:
- Server Hostname (from Step 6)
- HTTP Path (from Step 6)
- Catalog (from Step 3, e.g.,
mixpanel_export) - Schema (from Step 3, e.g.,
json_pipelines) - Service Principal ID (Application ID from Step 1)
- Service Principal Secret (from Step 2)
Partitioning
Raw tables are clustered by the event_date column, which is computed in your project’s timezone during data load. This clustering significantly improves query performance when filtering by date.
Data Schema
Mixpanel creates a raw table and a view with typed columns:
Raw Table Columns (mp_master_event_raw):
DATA(VARIANT) - Contains the complete event JSONevent_date(DATE) - Computed from event time in your project’s timezone
View Columns (mp_master_event):
user_id(STRING)time(TIMESTAMP)properties(VARIANT) - All event properties as semi-structured datainsert_id(STRING)event_name(STRING)distinct_id(STRING)device_id(STRING)event_date(DATE)
Queries
Remember to grant necessary permission of schema to any user wants to query the table
GRANT SELECT ON SCHEMA mixpanel_export.json_pipelines TO <user>;Databricks supports a VARIANT type that can store semi-structured JSON data. Use the :: syntax to extract and cast properties from VARIANT columns.
Basic event query
SELECT count(*)
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_name = 'sign up';Query nested properties
SELECT
event_name,
time,
distinct_id,
properties:button_name::STRING AS button_name,
properties:product:category::STRING AS product_category,
properties:price::DOUBLE AS price
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_name = 'Purchase'
AND properties:product:category::STRING = 'Electronics';Getting the number of events per day
SELECT
event_date,
count(*) as event_count
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_date >= '2024-01-01' AND event_date <= '2024-01-31'
GROUP BY event_date
ORDER BY event_date;Efficient date filtering
Use the event_date column for best performance:
SELECT
event_name,
time,
distinct_id,
properties:button_name::STRING AS button_name
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
AND event_name = 'Purchase'
ORDER BY time DESC;Costs
- Delta tables: Billed by your cloud provider (AWS S3, GCP GCS, or Azure ADLS) via Databricks
- Managed volumes: Temporary storage cleaned up after each export
- Compute: SQL Warehouse usage during COPY INTO operations
Was this page useful?