2023-01 Data Pipeline Initiative - Part 2

Data Pipeline v2: Polywrap Success Metrics

Overview

From a previous engagement we have developed a polywrap data lake, which caches data from github, npm, twitter, discord, discourse and more. (link to repo)

From this engagement it was planned:

In the future, we may wish to set up a data transformation pipeline that queries raw data from the data lake and transforms it into a state that facilitates analysis.

I would like to propose using the data lake to create an SDK and a better dev environment, that can help us better answer the status of key metrics like Polywrap’s user base, bugs reported, and more utilities like the user profiles of the willing members where we would be able to see a detail of their contributions to the ecosystem.

How can this be created?

Whithin the same data pipeline repo we can iterate on the prototype to create the SDK, this can pe published to pip, and we can use that locally. Additionally we can use this in a Google Colab environment.

Further down the road, when we have the desired metrics accessible, we could create a new subdomain in the to display the data (i.e.: data.polywrap.io/profiles)

Roadmap

1. Set out data to collect:

Begin with some preliminary internal research with interviews with Polywrap team members, and set some improvement areas we could potentially add to the data pipeline experience:

Structure:

  • Storing the data as tables helps for easier maintenance and avoid duplications. S3 is an unstructured database and our data is natively structured.
  • Build an SDK for easier interaction

Objectives:

  • Creating interfaces for what data we can track
  • Tracking two types of metrics: Development oriented and Community oriented.

Development Oriented

Category question answer
Users How many users does polywrap have? users.total SDK endpoint
Users How many “internal contributors” does polywrap have? users.internal_total SDK endpoint
Users How many “community contributors” does polywrap have? users.community_total SDK endpoint
Users What are the contributor data structure? users.Contributors data structure in the SDK returns a samble user table
Users What contributions are we measuring per user? users.Contributors.Contributions data structure in the SDK
Users Who are the N top contributors, month over month? users.top_contributors(N) SDK endpoint
Users Who are the N newest contributors? userst.newest(N) SDK endpoint
npm How many npm downloads, month over month? npm.downloads SDK endpoint
npm How many npm releases, month over month? npm.releases SDK endpoint
github What is the speed of PRs getting closed from the repos we are tracking/whitelisting? github.PR_speed SDK endpoint
github How many times have the whitelisted repos been cloned over time? github.clones SDK endpoint
github How many bugs have been reported by external contributors? github.external_bugs endpoint for bugs that have been reported and labeled in github
github How many interactions (clones, forks, stars, etc) have whitelisted repos over time? github.interactions SDK endpoint
github How many times has the client.invoke() been called? github.invokes SDK endpoint
github How many times has the client.subinvoke() been called? github.subinvokes SDK endpoint
github Status of the WASM test harness over time? github.wasm_test_harness SDK endpoint
github What is the status of the client test harness over time? github.clients_test_harness SDK endpoint
wrappers.io how many wrappers are being publshed, month over month? wrappersio.published SDK endpoint published wrappers published. Use this endpoint.
wrappers.io how many wrappers of each WASM language are being published, month over month? wrappersio.languages SDK endpoint
wrappers.io how many wrappers of each version are being published, month over month? wrappersio.versions SDK endpoint
wrappers.io What are the sizes of the wrappers we that are being published? wrappersio.wrapper_sizes SDK endpoint

Community Oriented

Category question answer
Discord How do the various languages channels activity compare over time? discord.languages
Discord How do Dev channels activity compare over time? discord.development
Twitter How was Followers growth over time? twitter.Followers
Twitter How was Mentions growth over time? twitter.Mentions
Twitter What does Aggregated interactions growth over time look like? twitter.interactions
Twitter What does reach growth look like over time? twitter.reach meassuring impressions
Analytics What is the number of landingpage views over time? analytics.pageviews
Analytics What are the top N most viewed pages from polywrap.io? analytics.top_pages(N)
Analytics Which are the most popular countries visiting polywrap.io? analytics.countries(N)

2. Develop the Python SDK

Today, the process to access the data pipeline is cumbersome, involving Amazon IAM signups and a SageMaker service instead of a simple local Jupyter Notebook or Google Colab that is free, quicker to start up and to share among the team. The data initiative leader has to manually prepare these reports and share them with the audience.

To reduce friction, it has been suggested to:

  • Create an SDK for more easily accessing the data pipeline that can:
    • Download data from S3 locally and experiment with it.
    • Querying data from polywrap ethereum wrappers and the python client.
    • Set up a google colab shared folder which uses the SDK and allows online analysis with less friction than Amazon Sagemaker.
  • Also we should be able to add a CI/CD implementation for workflows that are tedious and need automation if they havent been developed already:
    • Re-run the graph constantly and publish somewhere (Last Jupyters were updated on 9 months ago) One alternative is scheduled google colab runs, 3 cronjobs per month costs 10 cents per month.

Working on Deliverables

  • The SDK could have multiple modules:
    • Enricher Module: for collecting the data and adding it to S3. To stop having the implementations live within the AWS Lambda service as much as possible
    • Fetching Module: for querying AWS Athena to retrieve the data from S3 using boto3 module
    • Analytics Module: for processing the data to prepare for analysis. This would be part of the developed SDK and later be leveraged by Jupyter notebooks functions. This would process the data and create new metrics from the tables mentioned at the beginning of this proposal.
    • Colab:we could also use it in notebooks, like in Google colab, and in either a front end dashboard written in python or in a python back end that supports a js front end

Polywrap Data SDK

Milestone Description cost
Enrichment Refactoring existing S3 data loading and add it to the SDK
Fetching Using Athena service to query S3 buckets through the SDK
Fetching v2 Using the polywrap python client to query Ethereum instead of the current 3rd party module
Analysis Compiling the various analytics functions that exist today
Analysis v2 Adding the new functions to make more indepth metrics available
pip deployment publish the package (similar to npm)
subtotal $2750

Google Colab Polywrap Setup

Milestone Description cost
Colabs setup Set up a comfy Google Colab environment for data analytics that leverages the SDK
subtotal $350

CI/CD & Tests

Milestone Description cost
Code styling Set up CICD for the SDK
Code styling Set up tests for the SDK
Code styling Set up basic lint, styling scripts to verify code quality
Update Data Auto-running the lambdas every X days to keep the data available updated ()
subtotal $1100

Documentation

Milestone Description cost
Demo Video Overview video tutorial covering all functionality of the SDK and how to make inferences and graphics.
subtotal $250

TOTAL Funds Requested

4450 USDC and TBD WRAP

Velocity & Estimated Timeline

The numbers below are a rough estimates, and will change throughout the course of the project depending on the phase

Target Weekly Velocity Estimated Start Estimated Duration
~10-20 Hours/Week 1-Jan-23 ~3 Months

Sponsors

  • Kris
  • Kevin

Terms

By submitting this proposal, I understand that the DAO and my sponsors will be evaluating whether my work meets the acceptance criteria. If it does not, the DAO will determine what percentage of the proposal cost to pay, if any.

I also understand that I may not begin work until it is confirmed that the Snapshot proposal has passed.

[ X ] I agree


Appendix:

Example Data structure for Committed Contributors

To get to a similar data structure than this, some data wrangling is necessary, as user data is not linked like this yet anywhere, but having such arrangements can be leveraged very much in the immediate and long term:

id: 321,
eth: {
    address: '0xasdb2138901238',
    WRAP: 23,
    PolywrapNFTs: 2,
    snapshot: {
        proposals_submitted: 3,
        proposals_voted : 20,
    }
}
github: {
    username:'rihp',
    commits:,
    pull_requests: {
        open: 1,
        closed: 2,
        merged: 10}
    issues: {
        assigned: {
            open: 0,
            closed:20,
        }
    }
},
discord: {
    username:'Media#1337',
},
discourse: {
    username: 'DaoAdvocate',
    comments: 230,
    threads: 2,
},
twitter: {
    username: 'DaoAdvocate',
    verified: false,
    follows_polywrap: true,
    polywrap_mentions: 2,
    followers: 200,
}

This could also help track users which have interest in polywrap but are not developers (for example, if they mention polywrap on twitter)

This is a mocked dataframe of generated user data, which we can use to prepare the pipeline’s next step.

I think $4,450 seems like a small budget. I think if you estimate the hours and add them up, you’ll come up with a higher budget estimate.

You estimated a timeline of 10-20 hours per week over three months. At the high end that is 20 * 13 = 260 hours, or $17/hr. At the low end of the time estimate, you’re asking for $34/hr.

I understand you’re not highly experienced and it makes sense that your requested pay rate is lower. I just want to make sure everyone is happy with this.

Maybe we should either shrink the scope of the project or increase the budget? I don’t know what kind of budget we want to allocate for this, or what a fair pay rate is given your skill level.

@keeevin could you share your thoughts?

Should we also include project management in the budget? Maybe Emma or one of the engineers could provide some project management support?

1 Like

@DaoAdvocate can we first try shrinking the scope to see where we land in terms of $/hr?

Here are some suggestions:

  • Removing Analysis v2 (Adding the new functions to make more in-depth metrics available)
  • Removing Fetching v2 (Using the polywrap python client to query Ethereum instead of the current 3rd party module)
  • Instead of a demo video, let’s just have a simple README first

I recommend we stay within the $4.4k total budget, but shrinking scope to increase $/hr

cc @orishim for visibility

1 Like

after some brainstorming with @kris, we could remove such sections of the proposal and implement a simpler, SDK oriented approach that uses existing analytics endpoints. without implementing the python polywrap client for example.

The one area we have yet to decide whether is needed at the moment or subject to a proposal is the Colab/Jupyter notebook addition.

This would help python newbies get started on a web-based platform to test out the visualizations and edit them.

If the decision were to remove the notebooks too, we could reduce the hours needed by 10 to 20 hours definitely.

To add more context, had a quick sync call with emma, and we agreed that having at least a jupyter notebook accessible would be very beneficial for the ops team.

I think its possible still to build this within this engagement

Hey! I’m just seeing this.

  • Storing the data as tables helps for easier maintenance and avoiding duplications. S3 is an unstructured database, and our data is natively structured.

This is not correct. The data is stored in .parquet, has a schema, and is structured and partitioned using Athena SQL Engine. So you could easily avoid that situation by using SQL language or adding some code in the lambda. If that ever happens.

Build an SDK for easier interaction

The data is easily accessible through Athena. You can consume it mostly anywhere using Boto3.

By “endpoints,” I understand that you are talking about methods of this SDK. Most of this is already built on the lambdas.

To reduce friction, it has been suggested to:

Create an SDK for more easily accessing the data pipeline that can:
  * Download data from S3 locally and experiment with it.
  * Querying data from polywrap ethereum wrappers and the python client.
  * Set up a google colab shared folder which uses the SDK and allows online analysis with less friction than Amazon Sagemaker.
* Also we should be able to add a CI/CD implementation for workflows that are tedious and need automation if they havent been developed already:
  * Re-run the graph constantly and publish somewhere (Last Jupyters were updated on 9 months ago) [One alternative](https://cloud.google.com/scheduler) is scheduled google colab runs, 3 cronjobs per month costs 10 cents per month.
  • The data, as I said, is easily accessible for querying and downloading. Try to use Boto3 and Athena if you didn’t hear about it :wink:
  • The idea of a google collab having less “friction” than the AWS Studio. It’s not true. They are pretty similar services. In fact, AWS offers more flexibility and security than Google.
  • The third point I don’t fully understand it. The idea of publishing a graph, and a jupyter being run. Are you different things. You can set up a jupyter to be public and be updated with a job inside Studio with 2 clicks.

The next step on data should into creating a dashboard, jupyters(Playing with data), or extending those lambdas to recollect something mixing if that necessary. And displaying those materials somewhere. The idea of creating an SDK, could be valid if the purpose of it was to aggregate all the clients into one SDK. So it is easier to reuse, maintain and update the clients created. But not because of what you are saying.

I think everything offered here it’s already satisfied.

hey @WHYTEWYLL from our past calls, we should also try and load the data locally, and as you mention we can wrap the boto3 functions on our own SDK that we deploy to pip for example, and that makes every client usable from one single object client.discord and client.github for example

I agree the jupyter notebooks would be beneficial, but there has been comments about the cost of maintaining this infrastructure. Thought collabs were open and free to host as an alternative. Plus, they dont require setting up an IAM Amazon account to access the platform.

Does this make sense to you ?