Data Pipeline v2: Polywrap Success Metrics
Overview
From a previous engagement we have developed a polywrap data lake, which caches data from github, npm, twitter, discord, discourse and more. (link to repo)
From this engagement it was planned:
In the future, we may wish to set up a data transformation pipeline that queries raw data from the data lake and transforms it into a state that facilitates analysis.
I would like to propose using the data lake to create an SDK and a better dev environment, that can help us better answer the status of key metrics like Polywrap’s user base, bugs reported, and more utilities like the user profiles of the willing members where we would be able to see a detail of their contributions to the ecosystem.
How can this be created?
Whithin the same data pipeline repo we can iterate on the prototype to create the SDK, this can pe published to pip
, and we can use that locally. Additionally we can use this in a Google Colab environment.
Further down the road, when we have the desired metrics accessible, we could create a new subdomain in the to display the data (i.e.: data.polywrap.io/profiles)
Roadmap
1. Set out data to collect:
Begin with some preliminary internal research with interviews with Polywrap team members, and set some improvement areas we could potentially add to the data pipeline experience:
Structure:
- Storing the data as tables helps for easier maintenance and avoid duplications. S3 is an unstructured database and our data is natively structured.
- Build an SDK for easier interaction
Objectives:
- Creating interfaces for what data we can track
- Tracking two types of metrics: Development oriented and Community oriented.
Development Oriented
Category | question | answer |
---|---|---|
Users | How many users does polywrap have? |
users.total SDK endpoint |
Users | How many “internal contributors” does polywrap have? |
users.internal_total SDK endpoint |
Users | How many “community contributors” does polywrap have? |
users.community_total SDK endpoint |
Users | What are the contributor data structure? |
users.Contributors data structure in the SDK returns a samble user table |
Users | What contributions are we measuring per user? |
users.Contributors.Contributions data structure in the SDK |
Users | Who are the N top contributors, month over month? |
users.top_contributors(N) SDK endpoint |
Users | Who are the N newest contributors? |
userst.newest(N) SDK endpoint |
npm | How many npm downloads, month over month? |
npm.downloads SDK endpoint |
npm | How many npm releases, month over month? |
npm.releases SDK endpoint |
github | What is the speed of PRs getting closed from the repos we are tracking/whitelisting? |
github.PR_speed SDK endpoint |
github | How many times have the whitelisted repos been cloned over time? |
github.clones SDK endpoint |
github | How many bugs have been reported by external contributors? |
github.external_bugs endpoint for bugs that have been reported and labeled in github |
github | How many interactions (clones, forks, stars, etc) have whitelisted repos over time? |
github.interactions SDK endpoint |
github | How many times has the client.invoke() been called? |
github.invokes SDK endpoint |
github | How many times has the client.subinvoke() been called? |
github.subinvokes SDK endpoint |
github | Status of the WASM test harness over time? |
github.wasm_test_harness SDK endpoint |
github | What is the status of the client test harness over time? |
github.clients_test_harness SDK endpoint |
wrappers.io | how many wrappers are being publshed, month over month? |
wrappersio.published SDK endpoint published wrappers published. Use this endpoint. |
wrappers.io | how many wrappers of each WASM language are being published, month over month? |
wrappersio.languages SDK endpoint |
wrappers.io | how many wrappers of each version are being published, month over month? |
wrappersio.versions SDK endpoint |
wrappers.io | What are the sizes of the wrappers we that are being published? |
wrappersio.wrapper_sizes SDK endpoint |
Community Oriented
Category | question | answer |
---|---|---|
Discord | How do the various languages channels activity compare over time? | discord.languages |
Discord | How do Dev channels activity compare over time? | discord.development |
How was Followers growth over time? | twitter.Followers |
|
How was Mentions growth over time? | twitter.Mentions |
|
What does Aggregated interactions growth over time look like? | twitter.interactions |
|
What does reach growth look like over time? |
twitter.reach meassuring impressions |
|
Analytics | What is the number of landingpage views over time? | analytics.pageviews |
Analytics | What are the top N most viewed pages from polywrap.io? | analytics.top_pages(N) |
Analytics | Which are the most popular countries visiting polywrap.io? | analytics.countries(N) |
2. Develop the Python SDK
Today, the process to access the data pipeline is cumbersome, involving Amazon IAM signups and a SageMaker service instead of a simple local Jupyter Notebook or Google Colab that is free, quicker to start up and to share among the team. The data initiative leader has to manually prepare these reports and share them with the audience.
To reduce friction, it has been suggested to:
- Create an SDK for more easily accessing the data pipeline that can:
- Download data from S3 locally and experiment with it.
- Querying data from polywrap ethereum wrappers and the python client.
- Set up a google colab shared folder which uses the SDK and allows online analysis with less friction than Amazon Sagemaker.
- Also we should be able to add a CI/CD implementation for workflows that are tedious and need automation if they havent been developed already:
- Re-run the graph constantly and publish somewhere (Last Jupyters were updated on 9 months ago) One alternative is scheduled google colab runs, 3 cronjobs per month costs 10 cents per month.
Working on Deliverables
-
The SDK could have multiple modules:
-
Enricher Module
: for collecting the data and adding it toS3
. To stop having the implementations live within theAWS Lambda
service as much as possible -
Fetching Module
: for querying AWS Athena to retrieve the data fromS3
usingboto3
module -
Analytics Module
: for processing the data to prepare for analysis. This would be part of the developed SDK and later be leveraged by Jupyter notebooks functions. This would process the data and create new metrics from the tables mentioned at the beginning of this proposal. -
Colab
:we could also use it in notebooks, like in Google colab, and in either a front end dashboard written in python or in a python back end that supports a js front end
-
Polywrap Data SDK
Milestone | Description | cost |
---|---|---|
Enrichment | Refactoring existing S3 data loading and add it to the SDK | |
Fetching | Using Athena service to query S3 buckets through the SDK | |
Fetching v2 | Using the polywrap python client to query Ethereum instead of the current 3rd party module | |
Analysis | Compiling the various analytics functions that exist today | |
Analysis v2 | Adding the new functions to make more indepth metrics available | |
pip deployment |
publish the package (similar to npm ) |
|
subtotal | $2750 |
Google Colab Polywrap Setup
Milestone | Description | cost |
---|---|---|
Colabs setup | Set up a comfy Google Colab environment for data analytics that leverages the SDK | |
subtotal | $350 |
CI/CD & Tests
Milestone | Description | cost |
---|---|---|
Code styling | Set up CICD for the SDK | |
Code styling | Set up tests for the SDK | |
Code styling | Set up basic lint, styling scripts to verify code quality | |
Update Data | Auto-running the lambdas every X days to keep the data available updated () | |
subtotal | $1100 |
Documentation
Milestone | Description | cost |
---|---|---|
Demo Video | Overview video tutorial covering all functionality of the SDK and how to make inferences and graphics. | |
subtotal | $250 |
TOTAL Funds Requested
4450 USDC and TBD WRAP
Velocity & Estimated Timeline
The numbers below are a rough estimates, and will change throughout the course of the project depending on the phase
Target Weekly Velocity | Estimated Start | Estimated Duration |
---|---|---|
~10-20 Hours/Week | 1-Jan-23 | ~3 Months |
Sponsors
- Kris
- Kevin
Terms
By submitting this proposal, I understand that the DAO and my sponsors will be evaluating whether my work meets the acceptance criteria. If it does not, the DAO will determine what percentage of the proposal cost to pay, if any.
I also understand that I may not begin work until it is confirmed that the Snapshot proposal has passed.
[ X ] I agree
Appendix:
Example Data structure for Committed Contributors
To get to a similar data structure than this, some data wrangling is necessary, as user data is not linked like this yet anywhere, but having such arrangements can be leveraged very much in the immediate and long term:
id: 321,
eth: {
address: '0xasdb2138901238',
WRAP: 23,
PolywrapNFTs: 2,
snapshot: {
proposals_submitted: 3,
proposals_voted : 20,
}
}
github: {
username:'rihp',
commits:,
pull_requests: {
open: 1,
closed: 2,
merged: 10}
issues: {
assigned: {
open: 0,
closed:20,
}
}
},
discord: {
username:'Media#1337',
},
discourse: {
username: 'DaoAdvocate',
comments: 230,
threads: 2,
},
twitter: {
username: 'DaoAdvocate',
verified: false,
follows_polywrap: true,
polywrap_mentions: 2,
followers: 200,
}
This could also help track users which have interest in polywrap but are not developers (for example, if they mention polywrap on twitter)
This is a mocked dataframe of generated user data, which we can use to prepare the pipeline’s next step.