INDEX
Explanations
questions starting with "Who"
instances of the word "Who"
New Auto-Interp
Negative Logits
MER
-0.78
PORT
-0.67
Pilgrim
-0.63
Hyde
-0.62
mun
-0.58
outer
-0.58
compatibility
-0.57
readiness
-0.57
relaxation
-0.57
rog
-0.56
POSITIVE LOGITS
soever
1.24
ever
1.09
oping
1.05
abouts
1.03
else
0.97
cares
0.96
oped
0.91
knows
0.90
ileaks
0.80
cared
0.79
Activations Density 0.092%