INDEX
Explanations
instances of phrases that signal personal experiences or confessions
New Auto-Interp
Negative Logits
ingen
-0.18
etail
-0.17
lect
-0.17
erno
-0.17
967
-0.15
ledge
-0.15
åĻ
-0.14
orr
-0.14
ef
-0.14
ides
-0.14
POSITIVE LOGITS
fairness
0.17
dden
0.16
related
0.16
unrelated
0.15
tiler
0.15
AINS
0.15
stagram
0.15
zcze
0.15
deen
0.14
utra
0.14
Activations Density 0.108%