technical, snowflake,

Snowpro Data Engineer part 1 - Data Movement

Snowpro Data Engineer part 1 - Data Movement

After recently having completed my SnowPro Core exam, I am now preparing to take the Snowpro Data Engineer Exam. There is less preparation material, mainly no video series but mainly a study guide which describes the relevant topics but mainly collects links to the Snowflake documentation. So I will try to take notes for the exam using this blog. Let’s see how it goes.

The study guide can be found here.

External Functions:

  • referred to in SQL statement
  • calls an external API
  • Setup: create API integration, then external function based on it
  • Concurrency dependent on number of users running queries with external functions, size of queries, amount of compute resources in warehouse, number of warehouses

API integrations:

  • CREATE API INTEGRATION needs ACCOUNTADMIN or global CREATE INTEGRATION privilege
  • OWNERSHIP or USAGE needed to use it directly
  • API integration object is tied to cloud platform account and role, but not HTTPS proxy URL
    • You can create more than one instance of HTTPS proxy service in a cloud provider
    • You can use the same API integration to authenticate to multiple proxy services
  • Snowflake account can have multiple API integration objects
  • Multiple external functions can use same API integration object

Spark integration:

  • Spark driver sends SQL query using JDBC, Snowflake uses warehouse to process query and copies result to S3, connector retrieves data from S3 and puts it in Spark data frame
    • Snowflake handles processing for egress, Spark worker nodes handle data ingress
  • Query pushdown can handle complex Spark logical plans
  • Data is located in Snowflake, so process most things there instead of transferring large intermediate results
  • Catalyst produces optimized logical plan, then Spark decides whether to push down query to Snowflake
  • Transfer options: internal, external
  • If dataframes used, connector does not support SHOW, DESC, INSERT

  • Apache Arrow can be used for optimized query result fetching from Snowflake

ETL vs. ELT:

  • ETL works well for relational data that must maintain tabular structure
  • ETL for small data volume, source and target require different data types, primarily structured data
  • ELT is better for semi-structured data that must be maintained in its form until use cases are clear
  • ELT for large data volume, source and target database of same type, data is semi- or unstructured

Continuous data loading:

  • Use Snowpipe, Snowflake Connector for Kafka, third-party data integration tools
  • Change data tracking: use streams
    • Stream has METADATA$ACTION, METADATA$ISUPDATE, METADATA$ROW_ID columns
    • Only table owner can create initial stream on table
  • Recurring tasks: use tasks, e.g. by chaining and reading from a stream
    • Streams on shared tables: enable change tracking, extend data retention period for the table
  • Tasks can be defined in tree-like dependency structure, but all tasks must have the same owner
  • After creation, tasks must be resumed

Snowpipe:

  • load from named internal or external stages, table stages
  • Auto ingest only for external stages

Connectors and drivers:

  • Snowflake connector for Python
  • Snowflake connector for Spark
  • Snowflake connector for Kafka
  • Node.js driver
  • Go driver
  • .NET driver
  • JDBC driver
  • ODBC driver
  • PHP PDO driver
  • Snowflake SQL API

Python connector:

  • bypass data conversion to native Python types by using SnowflakeNoConverterToPython in snowflake.connector.converter_null module
  • Avoid binding data due to risk of SQL injection

Kafka connector:

  • Create internal stage to temporarily store data files, pipe to ingest data files for each topic partition, one table for each topic
  • Instances of the Kafka connector do not communicate with each other
  • Automatically creates target tables with RECORD_CONTENT and RECORD_METADATA columns

Data sharing:

  • Secure Direct Data Sharing: create share, add objects by granting privileges, add accounts to share
  • Snowflake Marketplace: all non-VPS accounts, become provider or consumer, IMPORT SHARE necessary to get or request data
  • Data Exchange: own data hab for collaborating around data between selected group of members

COPY INTO:

  • loads from named internal stage (table / user stage), load files there using PUT, or named external stage
  • Validation_mode: RETURN_n_ROWS, RETURN_ERRORS, RETURN_ALL_ERRORS
  • FORMAT_NAME and TYPE are mutually exclusive
  • Compression: gzip, bz2, brotli, zstd, deflate, raw_deflate
  • Allowed operations: column reordering, omission, casts, truncating text strings that exceed column length

Data movement:

  • per-byte fee when transferring from Snowflake account into cloud storages in another region or another cloud platform
  • Egress charges: for unloading from Snowflake, database replication, external functions

Misc commands:

  • DESCRIBE STAGE: describes values specified for properties in a stage (file format, copy, location) as well as defaults
    • SELECT METADATA$filename, METADATA$file_row_number FROM @stage
  • CREATE FILE FORMAT: file format describes set of staged data to be loaded into Snowflake
  • VALIDATE_PIPE_LOAD: validate files processed by Snowpipe within time range (up to last 14 days)
  • COPY_HISTORY table function: Snowflake data loading history, returns files loaded using COPY INTO and Snowpipe
    • more than the 10.000 rows in the LOAD_HISTORY view
  • CREATE STREAM: new stream recording changes for a source object
  • CREATE TASK: new task, can be cloned, schedule is optional
  • CREATE EXTERNAL TABLE:
    • no clustering keys, cloning, xml data, time travel, masking policy
    • METADATA$FILENAME, METADATA$FILE_ROW_NUMBER for file information
    • row access policy can be added
    • use POLICY_CONTEXT for simulating query results

Misc queries:

  • PIPE_LOAD_STATUS for validating loads
  • Explain queries: SELECT * FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON(‘query’)))

Misc tables:

  • WAREHOUSE_METERING_HISTORY for total credit consumption over time period