INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
Seym
-0.85
swick
-0.80
ilib
-0.79
Volunte
-0.77
cair
-0.75
KER
-0.70
mercial
-0.69
pron
-0.68
Correct
-0.67
DOC
-0.64
POSITIVE LOGITS
thora
0.74
harass
0.73
frank
0.71
grav
0.69
riz
0.69
masturb
0.67
rampage
0.67
morale
0.65
inciner
0.65
wrath
0.65
Activations Density 0.000%
No Known Activations
This feature has no known activations.