Skip to content

feat: repository analytics & repo populated & repo health score & health score refactor (IN-1054)#3987

Open
gaspergrom wants to merge 2 commits intomainfrom
feat/repository-analytics
Open

feat: repository analytics & repo populated & repo health score & health score refactor (IN-1054)#3987
gaspergrom wants to merge 2 commits intomainfrom
feat/repository-analytics

Conversation

@gaspergrom
Copy link
Copy Markdown
Contributor

@gaspergrom gaspergrom commented Mar 31, 2026

Summary

  • Add repository-level health score analytics with 12 separate metric pipes, a copy pipe aggregator, and supporting datasources
  • Refactor project-level health score pipes to use shared includes, eliminating benchmark logic duplication across 11 pipes
  • Extend project insights to support both project and repository records via a unified project_insights_copy_ds datasource

Details

Shared includes (includes/)

Introduced 11 .incl files containing parameterized benchmark/processing logic ($GROUP_COL, $SOURCE_NODE). These are used by both project-level and
repo-level health score pipes, ensuring benchmark thresholds (e.g., stars 0-9 → 0, 1000+ → 5) are defined in a single place.

  • Simple (benchmark only): active contributors, stars, forks, pull requests, active days, contributions outside work hours, issues resolution, merge lead
    time
  • Complex (multi-node processing + benchmark): contributor dependency (4 nodes), organization dependency (4 nodes), retention (2 nodes)

Repository health score pipes (repo_health_score_*.pipe)

12 new pipes, each with dual mode:

  • API mode (repoUrl param): single-repo query with optional startDate/endDate
  • Batch mode (no params): all repos, consumed by the copy pipe

Metrics: active contributors, contributor dependency, organization dependency, retention, stars, forks, issue resolution, pull requests, merge lead time,
active days, contributions outside work hours, security.

Repository health score copy pipe & datasource

  • repo_health_score_copy.pipe — joins all 12 separate pipes, computes 4 category percentages (contributor, popularity, development, security) + overall
    score. Runs daily at 01:50 UTC.
  • repo_health_score_copy_ds.datasource — stores all raw values, benchmarks (0-5), category percentages, and overall score per repository.
  • Search volume is excluded from repo-level scoring (popularity = stars + forks only).

Repository populated data

  • repositories_populated_ds.datasource + repositories_populated_copy.pipe — enriched repo data (contributor/org counts, software value, first commit).
    Runs hourly.

Project insights extension

  • project_insights_copy_ds.datasource — added type (project/repo) and repoUrl columns
  • project_insights_copy.pipe — extended with repo results (UNION ALL with project results). Repo health scores are stubbed to NULL for now.
  • project_insights.pipe — added type = 'project' filter
  • project_repo_insights.pipe — new API endpoint serving both project and repo insights with filtering by ids and repoUrls

Project health score refactor

Refactored 11 existing project-level health_score_*.pipe files to replace inline benchmark nodes with INCLUDE directives. No functional change — same
columns, same endpoint behavior. health_score_copy.pipe unchanged.


Note

Medium Risk
Adds new Tinybird datasources and scheduled copy pipes that materialize repository-level metrics and union them into existing insights tables, which can affect analytics outputs and query performance if joins/filters are incomplete. Existing project health score pipes are refactored via shared includes, so functional risk is low there but rollout touches many production queries.

Overview
Adds repository-level analytics in Tinybird: a new repo_health_score_copy_ds plus metric pipes and a daily copy/aggregation pipeline to compute per-repo benchmarks, category percentages, and an overall score.

Extends project_insights_copy_ds to store a union of project and repo insight records (type, repoUrl), updates project_insights.pipe to filter type = 'project', and introduces project_repo_insights.pipe to query both with combined filtering.

Refactors multiple project health_score_*.pipe definitions to reuse new parameterized benchmark includes (includes/health_score_*.incl), reducing duplicated CASE logic without intended behavior changes.

Written by Cursor Bugbot for commit e72807b. This will update automatically on new commits. Configure here.

…lth score refactor

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Copilot AI review requested due to automatic review settings March 31, 2026 21:24
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

sum(failedAssessments) AS failed,
sum(passedAssessments) AS passed,
sum(failedAssessments + passedAssessments) AS total,
round(100 * (passed / total)) AS percentage
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security score can divide by zero

Medium Severity

repo_health_score_security_category computes percentage as round(100 * (passed / total)) without guarding total = 0. When a repo/category has no remaining assessments after filtering, total can be zero and this expression can fail or yield invalid results, breaking repository security scoring.

Fix in Cursor Fix in Web

AND repo != ''
{% if defined(repoUrl) %}
AND repo = {{ String(repoUrl, description="Repository URL", required=False) }}
{% end %}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluded repos bypass security endpoint filtering

Low Severity

repo_health_score_security.pipe does not apply the repos_to_channels_excluded check used by other repository health pipes. Queries by repoUrl can return security scores for repositories intentionally excluded from analytics, producing inconsistent behavior across repository health endpoints.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds repository-level analytics (populated repo metadata + repo health score metrics and rollup) and extends the existing project insights/health-score pipeline to support both project and repository records, while refactoring project health-score benchmark logic into shared Tinybird includes.

Changes:

  • Introduces repository “populated” enrichment copy pipe + datasource and a daily repository health score copy pipe + datasource.
  • Adds repository health score metric pipes (dual API/batch modes) and a copy/rollup pipe that computes category percentages and overall score.
  • Extends project_insights_copy_ds to include both project and repo records, adds a new combined insights endpoint, and refactors project health score benchmark logic into shared includes/*.incl.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
services/libs/tinybird/pipes/repositories_populated_copy.pipe New copy pipe to materialize enriched repository metadata into repositories_populated_ds.
services/libs/tinybird/pipes/repo_health_score_stars.pipe New repo-level stars metric pipe + shared benchmark include.
services/libs/tinybird/pipes/repo_health_score_security.pipe New repo-level security metric pipe.
services/libs/tinybird/pipes/repo_health_score_retention.pipe New repo-level retention metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_pull_requests.pipe New repo-level PR metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_organization_dependency.pipe New repo-level org dependency metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_merge_lead_time.pipe New repo-level merge lead time metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_issues_resolution.pipe New repo-level issue resolution metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_forks.pipe New repo-level forks metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_copy.pipe New daily copy pipe to join repo metric pipes and compute category/overall scores into repo_health_score_copy_ds.
services/libs/tinybird/pipes/repo_health_score_contributor_dependency.pipe New repo-level contributor dependency metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_contributions_outside_work_hours.pipe New repo-level outside-work-hours metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_active_days.pipe New repo-level active-days metric pipe + shared include.
services/libs/tinybird/pipes/repo_health_score_active_contributors.pipe New repo-level active-contributors metric pipe + shared include.
services/libs/tinybird/pipes/project_repo_insights.pipe New endpoint serving combined project + repo insights from project_insights_copy_ds.
services/libs/tinybird/pipes/project_insights.pipe Updated to filter type = 'project' after project_insights_copy_ds becomes mixed-type.
services/libs/tinybird/pipes/project_insights_copy.pipe Extended copy logic to UNION project records with repo records, sourcing repo base from repositories_populated_ds.
services/libs/tinybird/pipes/health_score_stars.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_retention.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_pull_requests.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_organization_dependency.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_merge_lead_time.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_issues_resolution.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_forks.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_contributor_dependency.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_contributions_outside_work_hours.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_active_days.pipe Refactor to use shared benchmark include.
services/libs/tinybird/pipes/health_score_active_contributors.pipe Refactor to use shared benchmark include.
services/libs/tinybird/includes/health_score_stars.incl New shared benchmark/include logic for stars scoring.
services/libs/tinybird/includes/health_score_retention.incl New shared benchmark/include logic for retention scoring.
services/libs/tinybird/includes/health_score_pull_requests.incl New shared benchmark/include logic for PR scoring.
services/libs/tinybird/includes/health_score_organization_dependency.incl New shared processing + benchmark logic for org dependency scoring.
services/libs/tinybird/includes/health_score_merge_lead_time.incl New shared benchmark/include logic for merge lead time scoring.
services/libs/tinybird/includes/health_score_issues_resolution.incl New shared benchmark/include logic for issue resolution scoring.
services/libs/tinybird/includes/health_score_forks.incl New shared benchmark/include logic for forks scoring.
services/libs/tinybird/includes/health_score_contributor_dependency.incl New shared processing + benchmark logic for contributor dependency scoring.
services/libs/tinybird/includes/health_score_contributions_outside_work_hours.incl New shared benchmark/include logic for outside-work-hours scoring.
services/libs/tinybird/includes/health_score_active_days.incl New shared benchmark/include logic for active-days scoring.
services/libs/tinybird/includes/health_score_active_contributors.incl New shared benchmark/include logic for active-contributors scoring.
services/libs/tinybird/datasources/repositories_populated_ds.datasource New datasource for enriched repository metadata.
services/libs/tinybird/datasources/repo_health_score_copy_ds.datasource New datasource for repository health score rollups and raw metrics/benchmarks.
services/libs/tinybird/datasources/project_insights_copy_ds.datasource Updated schema to include type + repoUrl and adjusted sorting key for mixed-type records.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +27
SELECT
channel,
uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS contributorCount,
uniq(CASE WHEN organizationId != '' THEN organizationId ELSE NULL END) AS organizationCount
FROM activityRelations_deduplicated_cleaned_bucket_union
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uniq(CASE ... ELSE NULL END) will still count NULL as a distinct value in ClickHouse aggregate functions, so repositories that have any rows with empty memberId/organizationId can be overcounted by 1. Use uniqIf(memberId, memberId != '') / uniqIf(organizationId, organizationId != '') (or countDistinctIf) to exclude empty IDs without introducing NULL into the aggregation.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +29
{% if defined(startDate) %}
AND timestamp
> {{ DateTime(startDate, description="Filter after date", required=False) }}
{% end %}
{% if defined(endDate) %}
AND timestamp
< {{ DateTime(endDate, description="Filter before date", required=False) }}
{% end %}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single-repo (repoUrl) mode there’s no default previous-quarter time window unless startDate/endDate are provided, but batch mode restricts to the previous quarter. This makes the API endpoint return all-time active contributors by default. Consider applying the same previous-quarter bounds when repoUrl is defined and no dates are passed.

Suggested change
{% if defined(startDate) %}
AND timestamp
> {{ DateTime(startDate, description="Filter after date", required=False) }}
{% end %}
{% if defined(endDate) %}
AND timestamp
< {{ DateTime(endDate, description="Filter before date", required=False) }}
{% end %}
{% if defined(startDate) or defined(endDate) %}
{% if defined(startDate) %}
AND timestamp
> {{ DateTime(startDate, description="Filter after date", required=False) }}
{% end %}
{% if defined(endDate) %}
AND timestamp
< {{ DateTime(endDate, description="Filter before date", required=False) }}
{% end %}
{% else %}
AND timestamp >= toStartOfQuarter(now() - toIntervalQuarter(1))
AND timestamp < toStartOfQuarter(now())
{% end %}

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +26
AND timestamp
> {{ DateTime(startDate, description="Filter after date", required=False) }}
{% end %}
{% if defined(endDate) %}
AND timestamp
< {{ DateTime(endDate, description="Filter before date", required=False) }}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode is explicitly “last 365 days”. This makes the API endpoint return all-time results by default. Apply the same 365-day bounds in the repoUrl branch when no explicit dates are passed.

Suggested change
AND timestamp
> {{ DateTime(startDate, description="Filter after date", required=False) }}
{% end %}
{% if defined(endDate) %}
AND timestamp
< {{ DateTime(endDate, description="Filter before date", required=False) }}
AND timestamp > {{ DateTime(startDate, description="Filter after date", required=False) }}
{% else %}
AND timestamp >= toStartOfDay(now() - toIntervalDay(365))
{% end %}
{% if defined(endDate) %}
AND timestamp < {{ DateTime(endDate, description="Filter before date", required=False) }}
{% else %}
AND timestamp < toStartOfDay(now())

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +21
WHERE
(type = 'pull_request-opened' OR type = 'merge_request-opened' OR type = 'changeset-created')
AND channel = {{ String(repoUrl, description="Repository URL", required=False) }}
AND channel NOT IN (SELECT channel FROM repos_to_channels_excluded)
{% if defined(startDate) %}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode is “last 365 days”. This makes the API endpoint return all-time PR counts by default. Add a default 365-day filter in the repoUrl branch when no explicit dates are passed.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +21
WHERE
memberId != ''
AND (type, platform) IN (SELECT activityType, platform FROM activityTypes_filtered)
AND channel = {{ String(repoUrl, description="Repository URL", required=False) }}
AND channel NOT IN (SELECT channel FROM repos_to_channels_excluded)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode restricts to the last 365 days. This makes the API endpoint return all-time dependency metrics by default. Apply the same default window in the repoUrl branch when no explicit dates are passed.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +26
WHERE
category NOT IN ('Documentation', 'Vulnerability Management')
AND repo != ''
{% if defined(repoUrl) %}
AND repo = {{ String(repoUrl, description="Repository URL", required=False) }}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pipe doesn’t exclude repos/channels in repos_to_channels_excluded, unlike the other repo health score metrics. As a result, calling the endpoint with an excluded repoUrl can still return a security score. Consider adding an exclusion filter to match the other health-score pipes’ behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +45
COALESCE(owh.contributionsOutsideWorkHours, 0) AS contributionsOutsideWorkHours,
COALESCE(owh.contributionsOutsideWorkHoursBenchmark, 0) AS contributionsOutsideWorkHoursBenchmark,
COALESCE(sec.securityPercentage, 0) AS securityPercentage
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

securityPercentage is COALESCE’d to 0 when there is no matching row from repo_health_score_security, which penalizes repos that simply lack security evaluation data (and differs from how other missing metrics are excluded via arrayFilter(... >= 0)). Consider keeping it NULL when absent and handling it explicitly in overallScore.

Copilot uses AI. Check for mistakes.
CASE WHEN organizationId != '' THEN organizationId ELSE NULL END
) AS activeOrganizationsLast365Days
FROM activityRelations_deduplicated_cleaned_bucket_union
WHERE timestamp <= now()
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “last 365 days” repo metrics node only filters timestamp <= now() (no lower bound), so it actually counts all historical activity. Add a timestamp >= now() - INTERVAL 365 DAY bound (or equivalent) to match the column name/description.

Suggested change
WHERE timestamp <= now()
WHERE timestamp >= now() - INTERVAL 365 DAY
AND timestamp <= now()

Copilot uses AI. Check for mistakes.
CASE WHEN organizationId != '' THEN organizationId ELSE NULL END
) AS activeOrganizationsPrevious365Days
FROM activityRelations_deduplicated_cleaned_bucket_union
WHERE timestamp < now() - INTERVAL 365 DAY
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “previous 365 days” repo metrics node only filters timestamp < now() - INTERVAL 365 DAY (no lower bound), so it counts all activity older than 365 days rather than the 365–730 day window implied by the column name. Add a lower bound (e.g., timestamp >= now() - INTERVAL 730 DAY).

Suggested change
WHERE timestamp < now() - INTERVAL 365 DAY
WHERE timestamp >= now() - INTERVAL 730 DAY
AND timestamp < now() - INTERVAL 365 DAY

Copilot uses AI. Check for mistakes.
Comment on lines +166 to +169
uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS activeContributorsLast365Days,
uniq(
CASE WHEN organizationId != '' THEN organizationId ELSE NULL END
) AS activeOrganizationsLast365Days
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uniq(CASE ... ELSE NULL END) will count NULL as a distinct value in ClickHouse, so a repo with any rows missing memberId/organizationId can be overcounted by 1. Prefer uniqIf(memberId, memberId != '') / uniqIf(organizationId, organizationId != '') (or countDistinctIf).

Copilot uses AI. Check for mistakes.
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 5 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

pm.starsPrevious365Days AS starsPrevious365Days,
pm.forksPrevious365Days AS forksPrevious365Days,
pm.activeContributorsPrevious365Days AS activeContributorsPrevious365Days,
pm.activeOrganizationsPrevious365Days AS activeOrganizationsPrevious365Days
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale projects can break insights copy

High Severity

project_insights_copy_project_results selects pm.* metrics without COALESCE, but project_insights_copy_period_metrics only emits rows for segments with activity in the last 730 days. Projects with older/no recent activity get NULL period metrics, which conflicts with non-null UInt64 columns in project_insights_copy_ds and can fail the copy.

Additional Locations (1)
Fix in Cursor Fix in Web

rp.softwareValue AS softwareValue,
rp.firstCommit AS firstCommit
FROM repositories_populated_ds AS rp
JOIN repositories r FINAL ON r.id = rp.id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo status checks missing in insights join

Medium Severity

project_insights_copy_repo_base joins repositories by id without checking enabled, excluded, or deletedAt. Because project_insights_copy runs daily while repositories_populated_ds refreshes hourly, stale rows in rp can still be emitted as active repo insights for a full day after a repo is disabled, excluded, or deleted.

Fix in Cursor Fix in Web

SELECT
channel,
uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS contributorCount,
uniq(CASE WHEN organizationId != '' THEN organizationId ELSE NULL END) AS organizationCount
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null counted as contributor and organization

Medium Severity

repositories_populated_copy_contributor_org_counts uses uniq(CASE ... ELSE NULL END), and NULL is treated as a distinct value in uniq. Repositories with only empty memberId or organizationId can get counts of 1 instead of 0, which propagates incorrect contributorCount and organizationCount into project_insights_copy_ds.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants