We Dumped GitHub into DuckLake, Here's What We Found
By Patrick DeVivo & Jake Zeller
At Powerset, we've spent the past few months building technology to ingest and analyze large amounts of public data. The first source we've examined is activity on GitHub, via the GitHub API and raw Git protocol. Our initial findings from the GitHub data:
- Growth is exploding: YoY growth rate in new repos doubled in 2025
- Coding agents took off in 2025; OpenCode is leading by stars
- Anthropic overtook OpenAI in developer interest, but it's a tight race
- Microsoft leads Big Tech in new repo creation, contrary to the historical narrative that the company is closed
- Andreessen Horowitz has captured nearly as much open source value as all other early-stage VCs combined, driven by Databricks
While most firms that invest in startups keep their data confidential, we plan to open source and make publicly available as much of our work as possible. Details and data below.
GitHub Overview
As of this writing, we've counted ~430 million public repos, of which 417 million were created on or before Dec 31, 2025. From 2020 through 2024, repos grew 16% YoY on average. Growth accelerated sharply in 2025 to 34% YoY.
Public GitHub Repos by Creation Year
While there are many amazing projects on GitHub, there's also a huge amount of noise (e.g., throwaway personal projects, one-off test repos, homework assignments). We filter down the repo universe using two simple rules: (i) the repo must have at least one commit in the past 90 days, and (ii) the repo must have at least 10 stars. This takes us from ~430 million to 350,000 repos, a 1,200x reduction in noise.[1]
Today we're releasing Repo Score, a public view into the health of these ~350,000 active repos, and Repo Stars, a way to compare star growth across them. We built this using Modal for many workloads, DuckLake on Google Cloud to store data, Supabase to serve data, and OpenCode to help us ship faster with AI.
Coding Agents Take Off
In Feb 2025, Andrej Karpathy coined the term "vibe coding," marking the start of a new era of software development.
The four most popular CLI agents all launched in 2025: Claude Code, Codex, Gemini CLI, and OpenCode. While OpenCode was the last to launch, it ended the year with the most stars.
CLI Coding Tools: GitHub Stars in 2025[2]
Looking at community issues (opened by non-collaborators on the repo), we see Claude Code has the most new issue activity.
CLI Coding Tools: Daily New Community Issues
Weekly rolling average
CLI Coding Tools: Cumulative Community Issues
Includes open and closed issues
AI/ML: Outsized Attention
AI repos continued to dominate attention in 2025, representing 7% of new projects, but 22% of new stars. In contrast, every other category earned roughly proportional attention.
AI/ML: Share of Repos vs. Share of Stars by Year
Other Categories: Share of Repos vs. Share of Stars by Year
Average across all non-AI/ML categories
Prior to ChatGPT's release in Nov 2022, Frontend & Web Apps and Developer Tools dominated attention on GitHub.
Post-ChatGPT, AI/ML repos surged.
AI/ML Repos Stealing the Stars
Share of GitHub stars by category
Foundation Model Race
With AI repos taking off, we looked at trends related to foundation model providers. We searched repo names, descriptions, topics, and READMEs for keywords associated with each provider's models. While OpenAI led early on, Anthropic took the lead in Q4 2025.
Mentions in GitHub Repos
Count of repos mentioning each provider's models in name, description, topics, or README
Big Tech in Open Source
In 2025, major tech companies shipped a lot of new public repos, and a significant share of them are AI-related. We analyzed the repos created by ten major tech companies (selected by market cap and open source presence) and classified each as AI-related or not based on keywords in the repo name, description, topics, and READMEs. We then looked at the stars on these repos to understand where developer attention landed.
Big Tech New Repos
Public GitHub repos created in 2025 by major tech companies
Big Tech Stars
GitHub stars on repos created in 2025 by major tech companies
Top Repos by Company
Top 10 repos from each company's 2025 open source output
Commercial OSS Investors
To help founders identify top investors in commercial open source software (COSS) companies, we compiled a set of 50 unicorns ($1B+ valuation) and analyzed which firms led their Seed and Series A rounds.
Stage labels for investment rounds have shifted over time (e.g., creation of the pre-seed stage, emergence of $100+ million AI seeds, etc.), so any analysis is inherently subjective. The data feeding our charts is available here.
Benchmark and Lightspeed have the highest frequency of successful COSS investments, with Andreessen Horowitz and others close behind.
Top Investors by Count
Times listed as a lead investor in Seed/A rounds of COSS unicorns
Weighted by valuation, the distribution resembles a power law. Andreessen Horowitz dominates total value captured, driven by its early bet on Databricks.
Top Investors by Valuation Captured
Sum of current valuations across COSS unicorns where the firm led a Seed/A round
Future Plans
In the coming months, we'll be diving into additional data sources and research topics. What we're focused on next:
- More free tools like Repo Score and Repo Stars
- Predictive models for early-stage startup outcomes
- Analyzing open source contribution graphs
- Analyzing technical trends from new data sources (e.g., careers pages, data processor graphs)
While firms often keep this kind of work private (and vendors gate it behind paywalls), we're glad to share it with the community. Subscribe below for updates.
[1] All charts in this article are based on the dataset of ~350,000 active repos. [2] Repos show 0 stars prior to their launch date. Star counts above 40k are not available via the GitHub API and are interpolated linearly from the latest count.
Stay in the loop
Get notified when we publish new articles and research