INDEX
Explanations
terms related to misunderstanding or misrepresentation
New Auto-Interp
Negative Logits
trap
-0.18
on
-0.17
tant
-0.16
ickey
-0.16
tron
-0.16
c
-0.15
mw
-0.15
t
-0.15
ing
-0.15
trad
-0.15
POSITIVE LOGITS
chie
0.27
mis
0.24
emean
0.23
fits
0.22
appropri
0.21
fortune
0.21
ellaneous
0.20
direct
0.20
ubishi
0.20
Mis
0.19
Activations Density 0.008%