Back

Recruit Cracked Engineers Using Public GitHub Data

By Patrick DeVivo & Jake Zeller


We've been using public GitHub data internally at Powerset to help our portfolio founders identify and recruit top open source developers. Today, we're making the underlying data layer freely available for your agents. Check out the repo to get started.

Sometimes a video is worth 1,000 words:

Asking Claude Code to find NYC engineers who've worked on terminal coding agents, using the Powerset GitHub dataset over MCP.

What's in the Data

The data covers ~400,000 active GitHub repos and is queryable in two ways:

  1. MCP, if you want Claude, Codex, Cursor, or another MCP-compatible client to ask questions conversationally. Also available as a ChatGPT App
  2. DuckDB, if you want to attach directly to our DuckLake instance and run SQL yourself

Repos, contributors, activity, stars, languages, and project metadata are all available in a form your agents can use. No credentials required.

Example questions you can ask:

  • Find the 5 most impressive systems architects in San Francisco
  • Who are the best fits for this role? [insert link to engineering job description]
  • What are the fastest-growing terminal coding agents?

How it Works

We run a daily Modal cron to publish the data as a frozen DuckLake instance backed by Parquet files on Cloudflare R2. You can query it through our hosted MCP endpoint or attach to it directly from DuckDB.

MCP Setup

# Claude Code
claude mcp add --transport streamable-http powerset-research https://research-mcp.powerset.dev/mcp/
# OpenAI Codex
codex mcp add powerset-research --url https://research-mcp.powerset.dev/mcp/

DuckDB Setup

ATTACH 'ducklake:https://research-data.powerset.dev/github-public/latest/public.ducklake' AS github (READ_ONLY);

SELECT name_with_owner, stars_count, pushed_at
FROM github.repos
ORDER BY stars_count DESC
LIMIT 20;

For agents querying directly through DuckDB, we also provide a skill file with schema context, query patterns, and examples. Full setup instructions and documentation are available in the research-data repo.

Stay in the loop

Get notified when we publish new articles and research