I joined Collab and began working on adding a second data source to the recently acquired TrendPop’s platform. As a somewhat of a content creator myself, it’s been really exciting to have worked so closely to the creator economy.
My internship consisted of 3 main parts: creating an engineering proposal plan, implementation, and finally a presentation open to the entire company which included some execs.
The goal of creating this proposal was to align the entire team with what I would be doing over the summer and the tools and methodologies I would use to accomplish it. This part of my internship included
The most important thing to research was how to best extract data from YouTube. I settled on using a technique that I call forging API requests, in which you make requests that look identical to what a legitimate client would make to the backend server. Since most websites do use the AJAX approach, this is pretty effective on most websites. This approach has significant tradeoffs compared to a traditional HTML based web scraping approach, if you want to learn more about this approach check out lesson 1 in my everything-web-scraping series. The largest one is the lack of control over changes that the 3rd party makes to their API, I used commit history on
youtube-dl to see how frequently the API changed and it seems to be pretty rare where it was acceptable to use this method.
While investigating the best tools for the job, it was decided on that I should look into Apache Spark. It fit our use case perfectly. It allows us to easily scale our jobs across multiple threads and in the future if needed across multiple computers/executors if extremely computationally expensive.
One important thing was to ensure that there was good visibility onto what these jobs were doing as it’s always a challenge to maintain and debug these kinds of programs that are so dependent on third party API responses with countless edge cases. To increase the ease of debugging on all the jobs, I reported metrics around failing API requests, parsing failures, and any postgres errors to telegraf which communicates with an influxdb instance that Grafana pulls data from. Here’s a screenshot of one of the dashboards within Grafana I created.
The last thing I’ll talk about for this section was the pain of slowly working through dozens of edge cases since the “hidden” YouTube API is not officially documented and returns a lot of different types of structures like
videoRenderer which is awfully annoying to deal with.
Unfortunately, I can’t share the exact slides here. This presentation dove into case studies on how this new YouTube data could help a new potential customer. Next, was how the data could help Collab creator Zhong better understand their audience and how to further optimize their content strategy.
To answer all these questions, I spent this entire week relying heavily on my data science skills writing complex SQL queries and running more data-intensive code with python making heavy use of the pandas package to deliver some interesting insights into YouTube.
Overall, I had a great time and enjoyed working so closely with the creator economy.Back to Home