INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
halla
-0.80
glim
-0.79
pse
-0.75
volunt
-0.74
advoc
-0.74
enthusi
-0.74
blat
-0.71
unlaw
-0.69
manslaughter
-0.68
jri
-0.68
POSITIVE LOGITS
favorite
0.90
favorites
0.86
Favorite
0.78
favorite
0.74
fw
0.69
MAC
0.69
Cat
0.65
ãĥį
0.63
é»Ĵ
0.62
[_
0.62
Activations Density 0.000%
No Known Activations
This feature has no known activations.