INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
insula
-0.74
jer
-0.71
sed
-0.70
Jer
-0.67
abol
-0.67
pn
-0.67
olor
-0.66
TOR
-0.65
ratulations
-0.64
arkin
-0.64
POSITIVE LOGITS
SX
0.68
demos
0.68
orce
0.67
demo
0.65
agre
0.64
ufact
0.64
oti
0.63
fell
0.62
VG
0.62
Wim
0.61
Activations Density 0.000%
No Known Activations
This feature has no known activations.