INDEX
Explanations
phrases that indicate attribution or sourcing of information
New Auto-Interp
Negative Logits
ukan
-0.16
-0.15
raison
-0.15
leo
-0.15
yb
-0.14
imized
-0.14
iversity
-0.14
urum
-0.14
resse
-0.13
icers
-0.13
POSITIVE LOGITS
ly
0.28
ately
0.19
LY
0.18
according
0.17
ally
0.17
ÑģÑĮ
0.17
ingly
0.15
alf
0.15
ances
0.15
legend
0.15
Activations Density 0.035%