strava-snowflake

module
v0.0.0-...-a24ee6e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2023 License: MIT

README

strava-snowflake

track and export personal Strava data. Ingest and visualize in Snowflake.

copy github.com/stevenpelley/strava3golang/oauth_client_config.json and the resulting token.json here. Also needs snowflake key pair credentials placed in snowflake_config (see template)

Current State

passing tests for both duckdb and snowflake downloads activities within time range from strava skips retrieving streams for activities already in database or listed in command line options stores and transforms data in duckdb. python snowpark udfs organized, flatten_streams written and tested. Additionally fills time gaps. vectorized ewma done. Surprisingly slow. non-vectorized ewma done. Much faster. Also simpler. And correlates rows through the lateral join. Looked into managing stages and UDF artifacts. Temporary (anonymous) UDFs created by snowpark do clean up the associated files, but only once the session is cleaned. If the session is not explicitly closed this takes 24 hours. Close your sessions, people (me). Named UDFs do not automatically have their files cleaned up. This is problematic because I "create or replace" via snowpark, leaking the stage files. Conclusion: always know where UDF stage files go (e.g., put all of them in a single stage and don't put other files there) and then outer join files in the stage (LIST) against UDF artifact locations ("import" field of information_schema.functions). Worksheet snippets below. Created a simple pythong script in src/main.py to register the UDFs as permanent UDFs.

TODO

  • try out dbt for ETL. This might drastically simplify materializing streams.
  • dashboarding
    • simple dashboards should provide list of activities with summary stats, the ability to drill into each activity and provide more data, as well as a time-window-based training progress chart.
    • try out snowflake builtin dashboarding
    • try out some other dashboard options (superset, mesosphere)

Random information

Cleaning up UDF stage files worksheet:
list @udf_stage;
-- use the query id above to query those files in a select statement for joining
select * from table(result_scan('01b031a7-0604-d114-0001-305300036222'));


with funcs as (
    select
        function_catalog,
        function_schema,
        function_name,
        function_owner,
        argument_signature,
        data_type,
        function_language,
        created,
        last_altered,
        imp.value::string as import_value,
        substr(import_value, position('/', import_value)) as file_name,
        handler,
        packages
    from information_schema.functions, table(flatten(parse_json(imports))) as imp
    where 1=1
        and function_catalog = 'STRAVA'
        and function_language = 'PYTHON'
        and function_name not ilike 'snowpark_temp%'
), stage_files as (
  select
      *,
      substr("name", position('/', "name")) as file_name
  from table(result_scan('01b031a7-0604-d114-0001-305300036222'))
), joined as (
    select
        function_catalog,
        function_schema,
        function_name,
        import_value,
        funcs.file_name as funcs_file_name,
        stage_files."name" as stage_files_name,
        stage_files.file_name as stage_files_file_name
    from funcs full outer join stage_files on funcs.file_name = stage_files.file_name
)
select *
from joined
where function_name is null
;


remove @udf_stage/bf8643c74b62f4ba5d095beb4108a2a7c2f62191dc92106e13119af3594010f7/src.zip;

Directories

Path Synopsis
cmd
internal
intsql/intduckdb
functions to create and verify our schema.
functions to create and verify our schema.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL