Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/progress.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,6 @@
- 2026-03-19: Deployed production successfully and verified the live domain with a real smoke test: claim -> publish -> HTML read -> raw read -> list -> delete on `https://bul.sh`.
- 2026-03-19: Investigated true custom-domain external rewrites on Vercel. Redirects propagate to `bul.sh`, but rewrite routes did not behave as required on the custom domain.
- 2026-03-19: Adopted the pragmatic Vercel production read path: serve pre-rendered HTML through Hono with aggressive edge-cache headers so subsequent reads are CDN hits while content remains stored in Blob.
- 2026-03-20: Added a concrete cost-control and anti-abuse implementation plan to the project plan so hosted usage can be hardened later without redesigning the product.
- 2026-03-20: Implemented first-pass hosted abuse controls in the service layer: reserved namespaces, markdown size caps, claim and publish rate limits, and lazy reclaim of empty stale namespaces.
- 2026-03-20: Added automated coverage for the abuse controls through integration tests on the HTTP app.
99 changes: 98 additions & 1 deletion docs/project-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,100 @@ Local .pub mapping:
- Revisions: `rev:{page_id}:v{n} → { blob_key, published_at }` + `current_version` field
- Graduate to Postgres when KV queries become painful (listing, search, etc.)

## Cost Model & Abuse Control

The biggest risk on Vercel is not steady-state storage. It is abuse:
- too many namespace claims
- too many write operations
- too many large pages
- too many cache misses from spam content

Storage itself should stay cheap for a long time. The app is mostly text, pages are small, and read traffic is edge-cached. The practical cost center to control is **writes and churn**, not simply page count.

### Design Principle

Keep the hosted version easy to use for legitimate humans and AI agents, but make abuse expensive or slow.

### Phase 1 Controls (implement first)

**1. Claim rate limiting**
- Limit namespace claims per IP
- Suggested starting point:
- 3 claims per hour per IP
- 10 claims per day per IP
- Goal: stop namespace-squatting scripts and low-effort spam

**2. Publish rate limiting**
- Limit publishes by both IP and namespace
- Suggested starting point:
- 30 publishes per 10 minutes per namespace
- 100 publishes per hour per IP
- Goal: stop automated flooding while allowing normal iterative editing

**3. Markdown size limits**
- Hard cap on request body / markdown size
- Suggested starting point:
- 256 KB per page for v1
- Goal: prevent Blob from becoming arbitrary cheap object storage

**4. Reserved namespaces**
- Block obvious or sensitive names
- Initial reserved set:
- `admin`
- `api`
- `www`
- `support`
- `help`
- `install`
- `bul`
- `pubmd`
- `root`
- Goal: avoid confusion, collisions, and support burden

**5. Empty-namespace reclaim policy**
- If a namespace is claimed but no page is published within 7 days, reclaim it
- Goal: reduce squatting without adding a full identity system

### Phase 2 Controls (only if needed)

**6. Token rotation**
- Add `pubmd token rotate`
- Invalidate old namespace token on rotation
- Useful if a token leaks or a namespace is shared accidentally

**7. Lightweight audit visibility**
- Track:
- last claim time
- last publish time
- publish count over recent windows
- Goal: make abuse visible before building a moderation dashboard

**8. Optional friction for suspicious traffic**
- Only if needed later:
- proof-of-work
- challenge pages
- manual review queue
- Not a v1 priority

### Implementation Notes

- Enforcement should happen in the service layer, not just at the CDN edge
- Limits should be configurable via environment variables
- The hosted instance and self-hosted instances should be able to use different defaults
- Abuse controls should fail with clear machine-readable errors so AI agents can recover gracefully

### Metrics To Watch

- namespaces claimed / day
- namespaces reclaimed without publish
- publishes / namespace / day
- median markdown size
- 95th percentile markdown size
- cache hit ratio on page reads
- total Blob writes vs. reads

If those numbers stay low, keep the system simple. If they climb unnaturally, harden the hosted instance before scaling usage.

## Milestones

### M0: Spike (1 day)
Expand Down Expand Up @@ -163,6 +257,9 @@ Local .pub mapping:
- [ ] Math/KaTeX + Mermaid rendering (add when requested)
- [ ] Page versioning (keep history, show diffs) — data model already supports this
- [ ] Page renames with redirects — data model already supports this
- [x] Lightweight anti-abuse controls (claim/publish rate limits, reserved namespaces, max page size)
- [x] Namespace reclaim policy for empty claims
- [ ] Token rotation
- [ ] View count analytics
- [ ] Page collections with auto-generated index
- [ ] Expiring pages (TTL)
Expand Down Expand Up @@ -190,7 +287,7 @@ Local .pub mapping:
## Things to Decide

- [ ] **Name**: `pub`? `md.pub`? `mdpost`? `pushmd`? Need a good domain.
- [ ] **Free tier limits**: unlimited pages? Rate limit only? Storage cap?
- [ ] **Hosted free tier**: what claim/publish/size limits are acceptable before introducing stronger friction?
- [ ] **Subdomain vs path**: `namespace.domain` vs `domain/namespace` — start with path, add subdomain later?
- [ ] **Markdown flavor**: strict GFM or also support Obsidian-flavored ([[wikilinks]], ==highlights==, callouts)?
- [ ] **Default visibility**: unlisted (noindex) or public?
Expand Down
135 changes: 72 additions & 63 deletions src/core/blob-store.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { del, get, put } from "@vercel/blob";
import { del, get, list, put } from "@vercel/blob";
import { z } from "zod";

import {
Expand All @@ -11,14 +11,16 @@ import {
type FilePayload,
NamespaceNotFoundError,
type PublishRepository,
type RateLimitRecord,
} from "./repository.js";

const LookupRecordSchema = z.object({
pageId: z.string().uuid(),
});

const NamespacePageIndexSchema = z.object({
pages: z.array(StoredPageSchema),
const RateLimitRecordSchema = z.object({
count: z.number(),
windowStartedAt: z.string(),
});

export function createBlobStore(
Expand All @@ -29,13 +31,25 @@ export function createBlobStore(
namespace: string,
tokenHash: string,
): Promise<void> {
const record: NamespaceRecord = {
namespace,
tokenHash,
createdAt: new Date().toISOString(),
};
await saveNamespace(
{
namespace,
tokenHash,
createdAt: new Date().toISOString(),
},
false,
);
Comment on lines +37 to +41
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When allowOverwrite is false, put() will fail if the namespace blob already exists (including race conditions between getNamespace and claimNamespace). That failure currently propagates as a storage error and will be mapped to a generic HTTP 400/500. Consider translating overwrite-conflict errors into NamespaceExistsError to keep claim behavior stable under concurrency.

Copilot uses AI. Check for mistakes.
}

await writeJsonBlob(namespacePath(namespace), record, false);
async function saveNamespace(
record: NamespaceRecord,
allowOverwrite = true,
): Promise<void> {
await writeJsonBlob(
namespacePath(record.namespace),
record,
allowOverwrite,
);
}

async function getNamespace(
Expand All @@ -54,22 +68,50 @@ export function createBlobStore(
throw new NamespaceNotFoundError(namespace);
}

await writeJsonBlob(namespacePath(namespace), {
await saveNamespace({
...current,
lastPublishAt,
});
}

async function getRateLimitRecord(
bucket: string,
): Promise<RateLimitRecord | null> {
return readJsonBlob(rateLimitPath(bucket), RateLimitRecordSchema);
}

async function setRateLimitRecord(
bucket: string,
record: RateLimitRecord,
): Promise<void> {
await writeJsonBlob(rateLimitPath(bucket), record);
}

async function listPages(namespace: string): Promise<StoredPage[]> {
const index = await readJsonBlob(
namespaceIndexPath(namespace),
NamespacePageIndexSchema,
);
const pages = index?.pages ?? [];
const lookupResults = await list({
limit: 1000,
prefix: `${lookupPrefix(namespace)}/`,
token: metadataToken,
});
Comment on lines 90 to +95
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listPages hard-codes limit: 1000 and ignores pagination (hasMore/cursor). Namespaces with >1000 pages will be silently truncated. Consider iterating through all pages using the list API’s pagination fields so listing remains correct for larger namespaces.

Copilot uses AI. Check for mistakes.

return pages.sort((left, right) =>
right.updatedAt.localeCompare(left.updatedAt),
const pages = await Promise.all(
lookupResults.blobs.map(async (lookupBlob) => {
const lookup = await readJsonBlob(
lookupBlob.pathname,
LookupRecordSchema,
);

if (lookup === null) {
return null;
}

return findPageById(lookup.pageId);
}),
);

return pages
.filter((page): page is StoredPage => page !== null)
.sort((left, right) => right.updatedAt.localeCompare(left.updatedAt));
Comment on lines +112 to +114
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listPages can return duplicates if multiple lookup blobs point at the same pageId (e.g., if cleanup of an old slug lookup fails after a rename). Since the method maps lookups -> findPageById without deduping, the same page can appear multiple times. Consider deduping by pageId (or slug) after resolving pages.

Suggested change
return pages
.filter((page): page is StoredPage => page !== null)
.sort((left, right) => right.updatedAt.localeCompare(left.updatedAt));
const nonNullPages = pages.filter(
(page): page is StoredPage => page !== null,
);
const uniquePagesById = new Map<string, StoredPage>();
for (const page of nonNullPages) {
// Deduplicate by page identifier to avoid returning the same page multiple times
if (!uniquePagesById.has(page.id)) {
uniquePagesById.set(page.id, page);
}
}
return Array.from(uniquePagesById.values()).sort((left, right) =>
right.updatedAt.localeCompare(left.updatedAt),
);

Copilot uses AI. Check for mistakes.
}

async function findPageById(pageId: string): Promise<StoredPage | null> {
Expand Down Expand Up @@ -106,7 +148,6 @@ export function createBlobStore(
writeJsonBlob(lookupPath(page.namespace, page.slug), {
pageId: page.pageId,
}),
writeNamespaceIndex(page.namespace, page),
]);

if (previousPage !== null && previousPage.slug !== page.slug) {
Expand All @@ -122,7 +163,6 @@ export function createBlobStore(
token: metadataToken,
}),
del([page.markdownBlobKey, page.htmlBlobKey], { token: contentToken }),
removeFromNamespaceIndex(page.namespace, page.pageId),
]);
}

Expand Down Expand Up @@ -192,58 +232,20 @@ export function createBlobStore(
return `namespaces/${namespace}.json`;
}

function namespaceIndexPath(namespace: string): string {
return `indexes/${namespace}.json`;
}

function pagePath(pageId: string): string {
return `pages/${pageId}.json`;
}

function lookupPath(namespace: string, slug: string): string {
return `lookups/${namespace}/${slug}.json`;
function lookupPrefix(namespace: string): string {
return `lookups/${namespace}`;
}

async function writeNamespaceIndex(
namespace: string,
page: StoredPage,
): Promise<void> {
const current = await readJsonBlob(
namespaceIndexPath(namespace),
NamespacePageIndexSchema,
);
const nextPages = [...(current?.pages ?? [])];
const existingIndex = nextPages.findIndex(
(currentPage) => currentPage.pageId === page.pageId,
);

if (existingIndex === -1) {
nextPages.push(page);
} else {
nextPages[existingIndex] = page;
}

await writeJsonBlob(namespaceIndexPath(namespace), {
pages: nextPages,
});
function lookupPath(namespace: string, slug: string): string {
return `lookups/${namespace}/${slug}.json`;
}

async function removeFromNamespaceIndex(
namespace: string,
pageId: string,
): Promise<void> {
const current = await readJsonBlob(
namespaceIndexPath(namespace),
NamespacePageIndexSchema,
);

if (current === null) {
return;
}

await writeJsonBlob(namespaceIndexPath(namespace), {
pages: current.pages.filter((page) => page.pageId !== pageId),
});
function rateLimitPath(bucket: string): string {
return `rate-limits/${sanitizeBucket(bucket)}.json`;
}

return {
Expand All @@ -252,10 +254,13 @@ export function createBlobStore(
findPageById,
findPageBySlug,
getNamespace,
getRateLimitRecord,
listPages,
readHtml,
readMarkdown,
saveNamespace,
savePage,
setRateLimitRecord,
touchNamespace,
};
}
Expand All @@ -269,3 +274,7 @@ async function streamToString(
function stringifyJson(value: unknown): string {
return `${JSON.stringify(value, null, 2)}\n`;
}

function sanitizeBucket(bucket: string): string {
return bucket.replaceAll(/[^a-zA-Z0-9/_-]+/g, "_");
}
Loading
Loading