INDEX
Explanations
phrases indicating existence or presence
New Auto-Interp
Negative Logits
rna
-0.15
ami
-0.15
oya
-0.15
ıi
-0.14
bor
-0.14
gend
-0.14
ue
-0.14
é«
-0.14
å°ĸ
-0.13
Til
-0.13
POSITIVE LOGITS
lies
0.21
lie
0.18
reich
0.18
olit
0.15
Lies
0.15
lying
0.15
yonel
0.14
.ERR
0.14
'aff
0.14
yt
0.14
Activations Density 0.060%