The aggregator service is expected to avoid storing duplicate job postings, but duplicate job URLs are not currently prevented at the database level.
In aggregator-service/db.py, the insert_job() query uses ON CONFLICT DO NOTHING, and the comment says duplicate URLs are silently dropped. However, the jobs table does not define a UNIQUE constraint or unique index on the url column.
Because of this, the same job URL can be inserted multiple times.
Expected behavior:
- Non-null duplicate job URLs should be ignored.
- The dashboard should not show the same job posting multiple times.
- Stats and downstream workflows should not be affected by duplicate jobs.
Possible fix:
- Add a partial unique index on
jobs(url) where url IS NOT NULL.
- Update/add integration tests for duplicate URL insertion.
Hi @sharmavaibhav31 ,
I would like to work on this issue as part of GSSoC 2026.
Please assign this issue to me. I’ll start working on it and provide updates accordingly.
Thank you!
The aggregator service is expected to avoid storing duplicate job postings, but duplicate job URLs are not currently prevented at the database level.
In
aggregator-service/db.py, theinsert_job()query usesON CONFLICT DO NOTHING, and the comment says duplicate URLs are silently dropped. However, thejobstable does not define aUNIQUEconstraint or unique index on theurlcolumn.Because of this, the same job URL can be inserted multiple times.
Expected behavior:
Possible fix:
jobs(url)whereurl IS NOT NULL.Hi @sharmavaibhav31 ,
I would like to work on this issue as part of GSSoC 2026.
Please assign this issue to me. I’ll start working on it and provide updates accordingly.
Thank you!