Back

We Dumped GitHub into DuckLake, Here's What We Found

By Patrick DeVivo & Jake Zeller


At Powerset, we've spent the past few months building technology to ingest and analyze large amounts of public data. The first source we've examined is activity on GitHub, via the GitHub API and raw Git protocol. Our initial findings from the GitHub data:

  • Growth is exploding: YoY growth rate in new repos doubled in 2025
  • Coding agents took off in 2025; OpenCode is leading by stars
  • Anthropic overtook OpenAI in developer interest, but it's a tight race
  • Microsoft leads Big Tech in new repo creation, contrary to the historical narrative that the company is closed
  • Andreessen Horowitz has captured nearly as much open source value as all other early-stage VCs combined, driven by Databricks

While most firms that invest in startups keep their data confidential, we plan to open source and make publicly available as much of our work as possible. Details and data below.

GitHub Overview

As of this writing, we've counted ~430 million public repos, of which 417 million were created on or before Dec 31, 2025. From 2020 through 2024, repos grew 16% YoY on average. Growth accelerated sharply in 2025 to 34% YoY.

Public GitHub Repos by Creation Year

While there are many amazing projects on GitHub, there's also a huge amount of noise (e.g., throwaway personal projects, one-off test repos, homework assignments). We filter down the repo universe using two simple rules: (i) the repo must have at least one commit in the past 90 days, and (ii) the repo must have at least 10 stars. This takes us from ~430 million to 350,000 repos, a 1,200x reduction in noise.[1]

Today we're releasing Repo Score, a public view into the health of these ~350,000 active repos, and Repo Stars, a way to compare star growth across them. We built this using Modal for many workloads, DuckLake on Google Cloud to store data, Supabase to serve data, and OpenCode to help us ship faster with AI.

Coding Agents Take Off

In Feb 2025, Andrej Karpathy coined the term "vibe coding," marking the start of a new era of software development.

The four most popular CLI agents all launched in 2025: Claude Code, Codex, Gemini CLI, and OpenCode. While OpenCode was the last to launch, it ended the year with the most stars.

CLI Coding Tools: GitHub Stars in 2025[2]

Looking at community issues (opened by non-collaborators on the repo), we see Claude Code has the most new issue activity.

CLI Coding Tools: Daily New Community Issues

Weekly rolling average

CLI Coding Tools: Cumulative Community Issues

Includes open and closed issues

AI/ML: Outsized Attention

AI repos continued to dominate attention in 2025, representing 7% of new projects, but 22% of new stars. In contrast, every other category earned roughly proportional attention.

AI/ML: Share of Repos vs. Share of Stars by Year

Other Categories: Share of Repos vs. Share of Stars by Year

Average across all non-AI/ML categories

Prior to ChatGPT's release in Nov 2022, Frontend & Web Apps and Developer Tools dominated attention on GitHub. Post-ChatGPT, AI/ML repos surged.

AI/ML Repos Stealing the Stars

Share of GitHub stars by category

Foundation Model Race

With AI repos taking off, we looked at trends related to foundation model providers. We searched repo names, descriptions, topics, and READMEs for keywords associated with each provider's models. While OpenAI led early on, Anthropic took the lead in Q4 2025.

Mentions in GitHub Repos

Count of repos mentioning each provider's models in name, description, topics, or README

Big Tech in Open Source

In 2025, major tech companies shipped a lot of new public repos, and a significant share of them are AI-related. We analyzed the repos created by ten major tech companies (selected by market cap and open source presence) and classified each as AI-related or not based on keywords in the repo name, description, topics, and READMEs. We then looked at the stars on these repos to understand where developer attention landed.

Big Tech New Repos

Public GitHub repos created in 2025 by major tech companies

Big Tech Stars

GitHub stars on repos created in 2025 by major tech companies

Top Repos by Company

Top 10 repos from each company's 2025 open source output

Commercial OSS Investors

To help founders identify top investors in commercial open source software (COSS) companies, we compiled a set of 50 unicorns ($1B+ valuation) and analyzed which firms led their Seed and Series A rounds.

Stage labels for investment rounds have shifted over time (e.g., creation of the pre-seed stage, emergence of $100+ million AI seeds, etc.), so any analysis is inherently subjective. The data feeding our charts is available here.

Benchmark and Lightspeed have the highest frequency of successful COSS investments, with Andreessen Horowitz and others close behind.

Top Investors by Count

Times listed as a lead investor in Seed/A rounds of COSS unicorns

Weighted by valuation, the distribution resembles a power law. Andreessen Horowitz dominates total value captured, driven by its early bet on Databricks.

Top Investors by Valuation Captured

Sum of current valuations across COSS unicorns where the firm led a Seed/A round

Future Plans

In the coming months, we'll be diving into additional data sources and research topics. What we're focused on next:

  • More free tools like Repo Score and Repo Stars
  • Predictive models for early-stage startup outcomes
  • Analyzing open source contribution graphs
  • Analyzing technical trends from new data sources (e.g., careers pages, data processor graphs)

While firms often keep this kind of work private (and vendors gate it behind paywalls), we're glad to share it with the community. Subscribe below for updates.


[1] All charts in this article are based on the dataset of ~350,000 active repos. [2] Repos show 0 stars prior to their launch date. Star counts above 40k are not available via the GitHub API and are interpolated linearly from the latest count.

Stay in the loop

Get notified when we publish new articles and research