From e881f3a3c064484e17d4a1ee40c3de047035c7de Mon Sep 17 00:00:00 2001 From: Benjamin Case Date: Mon, 8 May 2023 11:26:10 -0400 Subject: [PATCH 1/4] Event Labeling Queries, first draft --- ipa-end-to-end-images/image2.png | Bin 0 -> 9520 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 ipa-end-to-end-images/image2.png diff --git a/ipa-end-to-end-images/image2.png b/ipa-end-to-end-images/image2.png new file mode 100644 index 0000000000000000000000000000000000000000..3be1360435a3b911230d9be4ef8dcf1b4b773b7d GIT binary patch literal 9520 zcmb`t2T)UA_$?Z`bdV-US1B4gQUpQ~DH1wTq=jCjgf7xSdY2ZErYQW74pKu2AcT$x zf)Gk*M(GfGy~nvTZ|2Rs`Omz$_spcsBxj%5dw+X1Fd-2$%QTlR4v5I0CuRoUou{`Nw^Yoiy}==~F+44wiy zl|K(M^{6Ac4KjHXKJva|LAI>SG0iNR^tBm-6Db9lFXLMtRx=?#Mn)1we$`Xi6>@Og zxw`W5P$48BB8(Wt9m(U<=3CcC(sJ<%OFlcY+`RwimuZ$Ih`wAtW%V$%&sLo%)FJ1Y zJ*@&2xME9%J!yVOI6Tg%&=fqq`!{E??PF*!-|Iq?**^Mm(HXy7(NOUC4gNYuA5^Q( z2#Pau4~^>A9&{~CW-$bDlcnkjsH!;@3<>uuYZHWPFyVqYtwwtBwpo^p=@f~9Ql0h2 zVF{MABy{xj=~uc0;c;;qOrJnXs;ZwT3_!rl3`hh)z;ES|z|e7VM#>Z*RsT_yA_n?1 zu1_LT>F!U6$T9A6#4@aLOE-OIhBRa=50P}d-dNyfvII6I^UJ!3u%o(yV z;=H@dFC1L#zZdkVSb>frMT598N7soJWc?B~T4_RSwJ}_!(uVqJ+fe zJr>QXX3A{6muB&W>L`LalbVkWbQF1z0}(Et^74y#8Z~1AR`^w;zZaweziR4!tVV+K zj+|NeFeUp;Rg$5YovsI^2NNt64`wkXwxFm)8@UyM%tfps;ro!BmWd zL}JQAlF-^I;n#{t;@xrU$Qf?lk5u98bSn{wVp;R5P!f6KUm$tP%Uc$4E)S&pzq(pN z`

0_Ctv(`;F>%3Gaz9InZMN!?rhsBmE6x`VcTX!n6pN$EoCF6JVKf8x#X-tr%=6 zO3?crbwePlBn7vH&#QMs(6C5CxnG{z<(r%&!L-mmV#>Qwt8GLbVrEfyMA;i`txd2m zVM2RUdvbiH&e7!D{e(At#AJ%Z>mCHb^qJP6E8Ij!X!q?s7%%xe^vV04jjEJamP!Vm);5VbG#)u_<{S<*(ahVefZWcoxb+PiJT`{JM z5iFawsPb?R?4}$s70dx0M(Wm%Qd*(ciYucrmiQ2;S}|v7&n`o|Bjqnf3u&%XrpcAv z6msu_2%+Xk6O9)PJ*kPw;b7kg|L(4}iy1ovMC ztTz^l)xwfOOYN92;=GVjExY@CGJ*F6ddCqX(Qz8N%$Wlb|M@z!H=81c!esdrtmyqj z2?jMrlsc*rVuZ4YeKOd?6=g*B<)XJj>D?9Ub$OPaPv80^)a@KVNkUiqt5S9^DUGdLQ(9ezE#8aTM)KZ8RkgrMUKo zCSHX=rLZ#Cha<3WIlquuiU@D!t+iG~^Rq+^rcw}ht)R$m(&p|Rf6vv*->b<>ga{9QztG>#?Tfi@i7Jc8uvEDg;V`b*4$ClFy+8C@VWm ziuglbUNY_llK*c=R8{@LO#%Yaz>73O5P+l)HU1wYT?;d07>QKMv74r%qZfS2Krk5b z@kv^EX~)7{TP4epp?F=gRC1JJxb?55`}J`~HWZ+1;$iVg%thx$rQ#rIjBR#{(3iFP zOOd?@c=^%{s49XbA~t*?OJ9IqwpIZ2izsDvrO%de4QGdu@>&*ca(_j!tbikECZU~` zdlycR@CYA7(QeHWYVaN?J|{r!)Pnay_`*&eY))6(Z0EgrJ1_HPMx!jWq+fE_qslFu zT|mn2-(1%XK(%*Os)yh1+qLd{;tad=?NT-{MVCVF12-Edd}Z3@ax4QVV*_@}p+L@H z-(v(BrMb(E3`tFKK`{`2t!j3X2UyO^dud0MpJeh8NcGi6}|oSkgAem^ADw6?>D z0B%8ssUP*Sf$K5Ih75BJdY`f=YD__SzBJxc2(}4?TIK_c{5uf(^$xmil zoJbsVHrPVxdRmB)k&!N*(ffyslZ4N+w|TjKdm%-SD~$KPFG$1vZ#bc6%jS+c7x57t zPY!-{Gy{KKmJ!bAx_H-+x~r^jLy+xYVL~ z;Pk2)zh!p=>#P5KDZWldF?0lJ=Q+G-W-?IQC3yWB=}EYJaMAkPfi$lZLFqnqMNdvPmOvoGp~A57&2IC zn3~ttZb>sb>2S{L z+{*N2VZ6DUAaNn-F&DY%(N*NSPyVZER7*r&QwQA}c>a0+lhYJ$z>b?om&YwbzK~~E zG0A~ji{{QD8f2Woj*PSJb9El0Z2eu=w{!BnUc zI^bL^fi;nB~7=5`gj~?XO217>&R)@*p&+i2?+^% z*a54&-_kbNEK}<`Tj*EQ5t+#+`Yxcx@8L?1X7_|jyMq)WD?IQMri_Ik`3woTYt z`!YOh_y8g0aGQ1^Fp8A{X69zvOhbFRJyXseav2_Sq81-FK%h&YL?T-~%u*DLN-EE2 zz1$6;6MV?a!;@b5Td0oX{f5iq(`Agv*<}fNe#nFaD#d5+90878l`;KneiQnaE6Tb_ zi>9Ue?w09!i#yrN?7nQF_izrpb=#YD@feFBiGnr`O)wj!RaDe!%iCU?1!`_@(8ErN z0erCE#2}X8d?suErm-(&|YO3UIdIE`TFTh&psJgtJSZ z_Us{~0&l0z`$#OqAJ6&y8Oi(k$a$78)^WCKz+-`ChI+pb&0yN-9hmDy##?_6?H8X& z$-qLhDx=eS(c%%;0!QdsSZJz^Q|G#`E2w;EF(~rttq0HR-5Rbp%}0jkcIh}=&Ha4a z`?GfYpypA%v;gS8V$+cl&l>{CQIF9<1-ZSs|8N85nZpSVfx*Y!K~%`AH`&53QPyLf zX{V>eF%|)osY_vZ6@&34=^FIF+{Q0FNQ%9TC02TDdVYQ(*y|s@2ZYSJ77PB5f;5>4 zxO#8+H@&<_KTj-Y|Jl=Ohk;rCI`2TML3Wa-W03f*>22o#Xa2c0-?drwUAq;s-IW&M z3KL;m!A($f>~33n9&2OcTFQX1O!cr6SNrf_$CK|Ly!=Y9Wn5ypQGQR`rFS0B+4t{b zIizd-kDjY|d9{n3TYnu)D$(+HfAY9Wv&`|aY^BSSkU6LQwWxr4zs>A-u2r?hrn#n4ZO>SA=NHi1mlu9X4t=hVKKKzAyJTd)e`-9%Xb7>LzGrC@` zGqRJ#*A9bt^`q_aU$kAH-z59WkC#}+j?LqL;pKx*k}U!T%|N~z(-~y3sih}g{Q5?w z12*c1*mDdQHXZ4?2E2%2-?77gZ*nB++{w6#PT#81$gK>#6X9Bt#*E2$iavW5efanL z*H2VH7JHp6N(!O^p8hU+ww=n3) za$Ya)XmeV0V;0RSdnqQ|dagzz_mQl;+yz=B6}?~6^-bHvVn87kqVB)xztNP@p#xw- zoGPz<^WgG_<&xa9&GAaJR>Gx|a(#)YY;DZJ~8Z}&!6wDkMOvWet$P$ zhwtBOkV@xmz*-u3Q$&$k>s;ul*S_f!XLER5;x)-!Y1*0jX^j!ckm~HS|H8f8tU_my zMG4GYVpYO7566k7Kz8m+d6=p@E}VaLp3T;?KmZ&_KsW`-8JKN2-ilb)%Au<^PR@46 z7)q1?Is9oeC~hl{1ZH;lXCZa|tgFFgm2_p5VJ~EIyZN=g@MM{GskRQ_IrBTss27c#rMyAR@WgU z$;GgiSldnrq^uqsm2st+A%%gF8REBOkdBgU z-6tLIp34afLyx}tf{d$8pL@GL44}Ds#9|4_1jej$fA2gsBJ66$>9_HC?QCZ89wUYO zU3PFiiy-J1L#Op}!j#bvQez>@r}~F}s28{X^@nRNoy`gg>$UMgR}a3cJjDR1g`lIp zdN2j$0T2vE88*Y_(7-5H3uz-hqFS%sC)waJl47eUwQsg!(>0d5_-uELnq2+gK|Q`l z58&U@LqJ{j?sVxLngA;u5H`*3`!2@`J5I7T?UYQNJKB)I=M=Ht%n3lwTsUihH4PbKd8!mqk42iwdZRZZY^^se0=YZM58kerzVMo5(`fql)S zJP0{R!MC;FekYoQgprw9shrkFY%;)F>a){q(C+qeNYQ1LxmKrM0-DY3rq(S8Ngm@L zECgtP$je}gs2Ai~UCc`-f?$DLd_rHo4o0~ZKcV|lNtkI%^jE3JA2S(oTENagJ2l@Y zyJ@mF_g9Hs@7}4&G11ZqkUD&}IcWE`S&GE38z^ z@9F5Lr(3-NX8CfoXeq5oJF{OYy)j3dckn9`RFHM&Y@E>SvYgu5xB)3l+BEx_BpZ+C zJ})zF%xU>@$t#~u2*au3Sny`5dVc4-$`;7C_z>xf+wEgibhj)lfXmTDakh`9D27@tJvS@C6wxABf z?(Y4>dw3mp;M37hb;p<9L*VN$(&frK^kK*tywNq^_q;XV&qBHl055n8+t0~K^5_-Q zE`RE-Yx&%|dqu0uI4iyqe-4Lb`x--y*1K+)-?5&Z_Na8(q~K#Q;~I~o7i5&C~->giP$oWi^qs841PE>i*-!WfrWDT>P3<+HS zK;H9A4Iv-Ln~LrRj6La;)0KylB{GlndReG^5Y-ecrwZE&HLiK7-%XNq+_r;y30+@e zZMoG#r_n80^Mo)5RAwhmLL(|F(?t56U;xwxii&_J-G=+a3!Az-t0J@_l$90=G9M6* zP!7H0w7|O+bIR$a%ecL&3n02?%Pf(i>oxj=wvEiMgmw9139&21!TQ|n@{0u5gr9cS z-qf=&(hfcHXhfs=MX_7m|7h-RC_jBsT3o8G^0jdS1*KMGM8aaAof)%so)j*kO#J{= zHRFwWoU^xd(>|C}6pF&9(yo4?)L9<7yIb13yf93>%Nvv^4q~#$NG1Bg#a$@`%ucB`Nc>_kHbq?`4sbV%awN@#h`a>7j1`)i~M^U>?a$-i(QR zR`S4FxM>wemi1*f=$kDkGOujLi@j8*@SEpk=*pb=ZZ814`e>Hx?Kf?Qe_pt*)%Hc_ zSfWp7)b{5FaR1E|(;wZe2g=Jh*+IGCQhQ)H%bpf|Od`T0%7^iSPvGQFL8nhWT7Id_ zX0_BJ2oqwYTM>Z^ykBPQgmS^9{RF?SWUhkVbmAs&G8^Jm6XJ9=P8v(ehuV&AWPNqT6sCKIKvzpgHEYbYn(;%B51`^mH;m(0 zWp1TKT~f8e#lO*eKB@o~?E)#g-=SCg&Ge?5X;;eH)CS{kT!Y6p+s2gVz?9%p1|TwD z%~bvdprk~zff>LaI!qHn!RJYX6Z9T);)Fm2>-H1h{<>{%98InJd#2&+gQIV>t$&8@ z`Xg8?9joT1Uj1sZmr|*Xyhtvw5Z!YY?l|VNm|2hY?7{+@hV2jH{9s;VSIa@Rgx9D8 zyUtXNSDh%1J#Dre9(2mr+I;#XEe)Xoh;jH&k}|Kkz$i`aiv2RAn157qxigL+SyWjm z1o}*YNrr8JNR40?rGP>sjGF@)Mr12t7$4MaC<;zk%Q-oxNRT0jNGy*-!Xje^FMN%8+;Q&4aSZSrkZ}y9v7vimW934%pmj%nx z$)Av|_+SMe!T3Qz)e?Hg(bMQK+OVYL%^9&XWXciUxpYE9F z`xxkYbi5G_um}*u70t&=quU2clkX05eRaaD30V#Nnh^Q^#qxrqjUi$3!g#Nip5X6 zsIkqo@X!k(nd;NvieE8by1`xLM^04b4VsK=MT1p&gbL)Boy zaFJ5VVqf>P;aVSS;rH8hf&U%&y6L;?MYWkN!OYE~E5)qq8rr;g1%K+*P0Wj7D_FG-ix*M6q8m>Xx7%O@ zz5MEd4oOWdMo(O6XlhHKPqAAog|&tu-~#)s6VsiKk1{1{2-jT?UexR1ft6U5j4 zV}nUIZWPq-B_H{?&4w@94*%93DUwK>Kl5o_X(s?eP(mzjwcdV$_n$Qj@YKp5=$Fjt z&TkuEnPg;JHOxz1Vjso+XvOG%`>4+>Vu7O6r7CeBvtjo#@&rs@>E@&5w6ru-c9pq+ zz<|gJ#1XJBI$UJ|S=!>|<>-Bz3phWsEiMZ0r7HQw1Ovm8pC5|}{Z16ShgGED{G8Ri zJ&DzyJ2g=B1YLJ7;0?thp4RmvLeE#So1@7=5Rt}lOH4s8A>_jrwmYz!2>L2C=mzYB zd4fIIZem)pMG%b?A9%(yrbi(N!zt?BW6KqFY(_j_d=|V^7PM=}3O{eX|A3#Y%7YScXi{~#tJL1ivVI#;{!J5US~-e+XefHO9G78hlJ9Ui{jgx}>?l+X`eqQ)c@O)ZC zx@OXG{opfI*kP!S>_srm-|wxT`gQ~Bii7OBPPBx(zak99N{&y+SSGjAv}ZvuYNSqw z2+aIZh_z;~J(;TNF@lu9IL+PspRAAFX^)WY8okB5S1agIiu^#=vng4|M8jE1O=b{ja|ce_W4=5Vfww;foyuDFCu!~ zn9j(|=!T3Txhq}CU#0eBu>9xAiNatsqVxRf#aR?9|DM9{oyeR>wDSn8edt)0?Qtt0>$S0L$QmTuy}@^1 z1#sAtaDn!ZJqIgUCtv%FBvzU?umu-+LM{-Ax|?VqrQC`h8av#$fD8%>0)~6`O;Sj% z^QSn)yv*^F5P-*N-x!XqwW2w>Yo-i7J>9+2?*mwX2>S88N!q1)$DhwbZ`_{%Pwq|B zRUE-qTe^KW3GV^-6ZUE1sQ%hXtg}Ousgsyc*F%q^Lqet6$jKpG#Zf}?By*Y$H? z26T?+0vpIVYTmdy{8wvWsVy2{Eb=01G8W0oEY@SF24UFmuSKSH;MBLD?4NKh7GzEt z+18nAKJ%2L$>_`*5HaKvv)y_tZbz!Bx`zD(f=TBl%$DdIOXNwuvS~VlbmFqggBfmT z08gTMiGu)^%-j*1URmf<`Ee1mDiI#BcwvP%c4m&FyBhlXXDl?;mVh#Qa|p>E9_JX? zC(`9_d_&ES{$RY_=3vbBXtbXDLF1G1pOZr)o}9^QwtFo_%c|fygcTa>Kg2`7$|4}FAt!`1k5NjFeAdPi@lbGnEZ3|MW-&9`2m|y z*UzpFXGs<0Q#r5SfNr<%!938S3MxE_#i!&Hz4+xH(r~_XjH^W)t~-Gd5{ZUnmPX=y zz~<3-P=IK}J9iB`)Sfuy2ap3YdbooF?s?FDC)aP&=jZ52=7?_#mn4}7Y9xuDm-iyP z4Md|-=bTyjy9%IMM-W}KdG@nmsn~Fb(x;2)4MLnv_K5=hX*>r%9w<6?C=sBPUBgOO zP9__52n)Ye){^nbJtl1UlPvxsNPR!Kq?yRm*ijj%K?#D=i2C?Z)?$ Date: Mon, 8 May 2023 11:27:01 -0400 Subject: [PATCH 2/4] Event Labeling Queries, first draft --- IPA-End-to-End.md | 228 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 228 insertions(+) diff --git a/IPA-End-to-End.md b/IPA-End-to-End.md index 9ce62f0..b81fcef 100644 --- a/IPA-End-to-End.md +++ b/IPA-End-to-End.md @@ -1907,6 +1907,234 @@ After computing the aggregates within each category, the _helper parties_ P The noise is generated within an MPC protocol. This could be done by having each party sample noise independently then add it to the secret shared aggregated value. This is a simple approach but has the drawback that the privacy guarantees in case of a corrupted MPC party are lower since the corrupted party will know their share of the noise and deduct it from the aggregated value. A better but more complicated approach is to run a secure coin tossing protocol between parties P1, P2, P3 where the coins are private and then use these coins to run a noise sampling algorithm within MPC to generate secret shares of the DP noise. This noise is then added to the secret shared aggregated value. Using the second approach, a malicious party cannot manipulate the protocol to see a noisy aggregated value with less noise. Hence, the privacy guarantees match the amount of DP noise generated and added as specified in the protocol description even when one party is malicious. + + +# Event Labeling Queries with per matchkey DP bound + +There has been considerable interest in supporting event level outputs in IPA ([IPA issue 60](https://github.com/patcg-individual-drafts/ipa/issues/60), [PATCG issue 41](https://github.com/patcg/docs-and-reports/issues/41)). This was discussed at the May 2023 PATCG meeting where the [consensus ](https://github.com/patcg/docs-and-reports/pull/43)was we could support this so long as we can enforce a per user bound on the information released. + +Here we outline how an Event Labeling Query that labels events with noisy labels can be done in a way that lets us maintain a per matchkey DP bound. We also consider how these new queries can be compatible with an IPA system that flexibly supports either aggregation queries or event labeling queries. + + +## Source site differences + +There are two settings for a source site + + + +1. A source site that knows who it is showing ads to and can tie together source reports belonging to the same person. This would be the case of a publisher website with logged-in users. +2. A source site, such as an ad-network, that shows ads across many different websites and doesn’t necessarily know when it is showing an ad to the same person. + +We design an Event Labeling Query that can support both settings, though in the first setting the noise level per event may be more predictable and consistent since the source site can put in the same number of source reports per matchkey. To help support the second setting, we can output some Reach & Frequency statistics to let the Report Collector learn about how many actual users were present in their set of source reports. This will inform them about the expected noise level that was added to the event labels. + + +## Input/Output Structure + +In an Event Labeling Query, a source site submits a source fan-out query where the source reports contain the encrypted matchkey and timestamp as well as a source_id, which is a unique index to identify this report back to the source site’s user. The source site gets as output a row for every source report submitted with each output consisting of the source_id and a label. The label is either 0 or 1. + + +![Inputs and Outputs of Event Labeling Queries](ipa-end-to-end-images/image2.png "image_tooltip") + + + + +**Query Inputs:** + + + +* Source reports (is_trigger, matchkey, timestamp, source_id) +* Trigger reports (is_trigger, matchkey, timestamp) +* DP budget for this query: query_epsilon + +**Query Output:** + + + +* A row for every source report: (source_id, label) where label in {0,1} + +In each query we will count the number of source reports per matchkey and use this to scale the noise for labeling that user’s events, so as to maintain a constant per user information release. + + +## Query Stages for Event Labeling Queries + +The stages of the query are as follows where the first two are the same as in regular IPA aggregation queries: + +* Report Collector can presort by timestamp +* Sort by matchkey in the MPC +* Attribution + * Source rows will be labeled as attributed, 1, or not attributed, 0, by some attribution logic that attributes trigger events to earlier source events. It seems we should be able to support any attribution logic here. +* Label Flipping + * First we count the number of source reports that share the same matchkey, call this matchkey_source_count. + * For every source row flip it attribution label with probability p, where p is derived from (query_epsilon / matchkey_source_count ) +* Output (source_id, label) to the Report Collector + + +## Example + +Consider the following example where we perform last touch attribution. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Timestamp + Matchkey + source_id + is_trigger + Label + Noisy label +
110 + AAA + 58934 + 0 + 1 + 1 (to be flipped with prob derived from query_epsilon / 1) +
230 + AAA + NA + 1 + + +
002 + CCC + 67654 + 0 + 0 + 0 (to be flipped with prob derived from query_epsilon / 3) +
030 + CCC + 76643 + 0 + 0 + 0 (to be flipped with prob derived from query_epsilon / 3) +
485 + CCC + 66322 + 0 + 1 + 1 (to be flipped with prob derived from query_epsilon / 3) +
672 + CCC + NA + 1 + + +
+ + +**Output to Report Collector:** + + + + + + + + + + + + + + + + + + + + + + + +
source_id + Noisy label +
58934 + 1 +
67654 + 1 (flipped) +
76643 + 0 +
66322 + 1 +
+ + + +## Per user DP bound + +In order to bound the amount of information released about a user (matchkey) per epoch we needed to do two things: + + + +1. Reduce the per epoch budget for each query +2. Within each query ensure that all the labeled events released for a user don’t exceed that particular query’s budget. We can do this by letting the probability for each event’s flip be derived from an epsilon of (query epsilon / number of events labeled per user). + +A nice property is that we do not need to rate limit report creation on-device or limit replays of reports. + + +## Simultaneous budgeting for both aggregation queries and event labeling queries + +As Charlie pointed out in his [presentation](https://docs.google.com/presentation/d/1Cc8_S46m4-z8o_dM4egYSSEyHxc-bGinSw_Say_93uA/edit#slide=id.g21ee318a45e_0_52), it would be nice to have the flexibility to choose between composition using labeled events or using central DP as the utility may vary depending on the number of queries and use case within the same epsilon budget. + +In particular, we would like the IPA system to support both aggregation queries and event labeling queries. It seems that the above construction enables this if both types of queries deduct from the same per epoch budget. + +Also the same encrypted match key returned by `get_encrypted_matchkey()` can be used for these Event Labeling Queries or for regular Aggregation Queries. The report collector will just add different associated data to form the reports to be sent into the MPC. + + # Technical Discussion and Remarks From a91d2c26b6db8868f993c6f994df67845fb2c072 Mon Sep 17 00:00:00 2001 From: Benjamin Case Date: Mon, 8 May 2023 11:37:39 -0400 Subject: [PATCH 3/4] update Table of Contents --- IPA-End-to-End.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/IPA-End-to-End.md b/IPA-End-to-End.md index b81fcef..e3ab8e4 100644 --- a/IPA-End-to-End.md +++ b/IPA-End-to-End.md @@ -46,6 +46,13 @@ This document provides an end-to-end overview of that protocol, focusing primari * [Oblivious Last Touch Attribution](#oblivious-last-touch-attribution) * [User Level Sensitivity Capping](#user-level-sensitivity-capping) * [Computing the Aggregates](#computing-the-aggregates) +* [Event Labeling Queries witth per matchkey DP bound](#event-labeling-queries-with-per-matchkey-dp-bound) + * [Source site differences](#source-site-differences) + * [Input/Output Structure](#inputoutput-structure) + * [Query Stages for Event Labeling Queries](#query-stages-for-event-labeling-queries) + * [Example](#example) + * [Per matchkey DP bound](#per-matchkey-dp-bound) + * [Budgeting for both aggregation and event labeling queries](#budgeting-for-both-aggregation-and-event-labeling-queries) * [Technical Discussion and Remarks](#technical-discussion-and-remarks) * [Optimizations](#optimizations) * [Two Party Secret Sharing](#two-party-secret-sharing) @@ -2114,19 +2121,17 @@ Consider the following example where we perform last touch attribution. -## Per user DP bound +## Per matchkey DP bound -In order to bound the amount of information released about a user (matchkey) per epoch we needed to do two things: +In order to bound the amount of information released about a matchkey (which is our best approximation of a user in the system) per epoch we needed to do two things: - - -1. Reduce the per epoch budget for each query -2. Within each query ensure that all the labeled events released for a user don’t exceed that particular query’s budget. We can do this by letting the probability for each event’s flip be derived from an epsilon of (query epsilon / number of events labeled per user). +1. Deduct from the per epoch budget for each query +2. Within each query ensure that all the labeled events released for a matchkey don’t exceed that particular query’s budget. We can do this by letting the probability for each event’s flip be derived from an epsilon of (query epsilon / number of events labeled per matchkey). A nice property is that we do not need to rate limit report creation on-device or limit replays of reports. -## Simultaneous budgeting for both aggregation queries and event labeling queries +## Budgeting for both aggregation and event labeling queries As Charlie pointed out in his [presentation](https://docs.google.com/presentation/d/1Cc8_S46m4-z8o_dM4egYSSEyHxc-bGinSw_Say_93uA/edit#slide=id.g21ee318a45e_0_52), it would be nice to have the flexibility to choose between composition using labeled events or using central DP as the utility may vary depending on the number of queries and use case within the same epsilon budget. From 612cf5cc583e59cb4fbada192ec1bc18c8fda116 Mon Sep 17 00:00:00 2001 From: Benjamin Case Date: Wed, 17 May 2023 13:51:25 -0400 Subject: [PATCH 4/4] capping version for Event Labeling Queries --- IPA-End-to-End.md | 29 ++++++++++++++++++++--------- 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/IPA-End-to-End.md b/IPA-End-to-End.md index e3ab8e4..0a69ba9 100644 --- a/IPA-End-to-End.md +++ b/IPA-End-to-End.md @@ -1927,12 +1927,12 @@ Here we outline how an Event Labeling Query that labels events with noisy labels There are two settings for a source site - - 1. A source site that knows who it is showing ads to and can tie together source reports belonging to the same person. This would be the case of a publisher website with logged-in users. 2. A source site, such as an ad-network, that shows ads across many different websites and doesn’t necessarily know when it is showing an ad to the same person. -We design an Event Labeling Query that can support both settings, though in the first setting the noise level per event may be more predictable and consistent since the source site can put in the same number of source reports per matchkey. To help support the second setting, we can output some Reach & Frequency statistics to let the Report Collector learn about how many actual users were present in their set of source reports. This will inform them about the expected noise level that was added to the event labels. +We design two different Event Labeling Queries that can support these settings. In the first setting the source site can (with pretty high confidence) supply the same number of source reports per matchkey. The MPC will label all of the events scaling the noise by the number of reports per matchkey. + +In the second setting a Report Collector will not know in advance how many source reports are for the same matchkey. In this case the RC will specify a cap on the number of source events it wants labeled per matchkey. This allows the noise for each of these events to be scaled by the cap and thus the noise level added will be known to the Report Collector which is useful in debiasing downstream uses of the data. To help support the second setting, we can also output some Reach & Frequency statistics to let the Report Collector learn about how many actual users were present in their set of source reports. This will inform them about how many events exceeded the cap and we labeled as zeros before having the noise applied. ## Input/Output Structure @@ -1952,6 +1952,7 @@ In an Event Labeling Query, a source site submits a source fan-out query where t * Source reports (is_trigger, matchkey, timestamp, source_id) * Trigger reports (is_trigger, matchkey, timestamp) * DP budget for this query: query_epsilon +* Cap (optional; if no cap is supplied the noise will scale with the number of reports per matchkey) **Query Output:** @@ -1970,15 +1971,19 @@ The stages of the query are as follows where the first two are the same as in re * Sort by matchkey in the MPC * Attribution * Source rows will be labeled as attributed, 1, or not attributed, 0, by some attribution logic that attributes trigger events to earlier source events. It seems we should be able to support any attribution logic here. -* Label Flipping - * First we count the number of source reports that share the same matchkey, call this matchkey_source_count. - * For every source row flip it attribution label with probability p, where p is derived from (query_epsilon / matchkey_source_count ) +* Label Flipping; we consider two methods + * No Cap Supplied: + * First we count the number of source reports that share the same matchkey, call this matchkey_source_count. + * For every source row with probability p we asign its label randomly from {0,1}, otherwise leave it unchanged. The p is derived from (query_epsilon / matchkey_source_count ) as p = 2 ( 2 - 1 + e^(query_epsilon / matchkey_source_count )) + * Cap Supplied: + * For source rows with the same matchkey asign the attributed label for as many rows up to the cap for a matchkey. If there are more than the cap number of source rows for a matchkey, the excess ones are labeled as 0s. + * or every source row with probability p we asign its label randomly from {0,1}, otherwise leave it unchanged. The p is derived from (query_epsilon / cap ) as p = 2 ( 2 - 1 + e^(query_epsilon / cap )) * Output (source_id, label) to the Report Collector ## Example -Consider the following example where we perform last touch attribution. +Consider the following example where we perform last touch attribution and cap the number of source events labeled per matchkey at 3. @@ -2007,7 +2012,7 @@ Consider the following example where we perform last touch attribution. - @@ -2126,10 +2131,16 @@ Consider the following example where we perform last touch attribution. In order to bound the amount of information released about a matchkey (which is our best approximation of a user in the system) per epoch we needed to do two things: 1. Deduct from the per epoch budget for each query -2. Within each query ensure that all the labeled events released for a matchkey don’t exceed that particular query’s budget. We can do this by letting the probability for each event’s flip be derived from an epsilon of (query epsilon / number of events labeled per matchkey). +2. Within each query ensure that all the labeled events released for a matchkey don’t exceed that particular query’s budget. + * In the case without a cap, we do this by letting the probability for each event’s flip be derived from an epsilon of (query epsilon / number of events labeled per matchkey). + * In the case with a cap, we do this by capping the number of events labeled and letting the probability for each event’s flip be derived from an epsilon of (query epsilon / cap). A nice property is that we do not need to rate limit report creation on-device or limit replays of reports. +### Proof for capping case +Consider two databases D and D' that are adjacent. After rate-limiting to sensitivity, a users contribution will look like $(x_1, x_2, ... x_n)$ and $(x_1', x_2', ..., x_n')$ for n events, where the value at any $x_i$ is the label. After capping, these vectors will differ on at most $s$. If you consider an output of $(y_1, ..., y_n)$, then we have $$\frac{Pr[y_1,...,y_n | x'_1,...x'_n]}{Pr[y_1,...,y_n | x_1,...x_n]} = \prod_{i=1}^{n}{\frac{Pr[y_i | x'_i]}{Pr[y_i | x_i]}} += \prod_{i \in differing\_rows}{\frac{Pr[y_i | x'_i]}{Pr[y_i | x_i]}}$$ +That is, we can analyze each of the $s$ differing columns independently, and if you can show that the mechanism on a single column is bounded by $e^\epsilon$ it implies the whole mechanism is $s \epsilon$-differentially private. (This is with an add-remove notion of DP that will just strip conversions for a converting user or add fake conversions for a non-converting user. With a "replacement" notion of DP that merely modifies the conversion patterns for a user the rows will differ on $2s$ values and everything goes through with a factor of 2 in the bound.) ## Budgeting for both aggregation and event labeling queries
1 1 (to be flipped with prob derived from query_epsilon / 1) + 1 (to be flipped with prob derived from query_epsilon / 3)