INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
shit
-0.76
aroo
-0.73
antic
-0.71
Shit
-0.68
swast
-0.68
Donetsk
-0.67
fres
-0.66
insensitive
-0.65
letter
-0.65
****
-0.64
POSITIVE LOGITS
JS
0.84
eq
0.80
PE
0.78
abeth
0.76
NL
0.76
ESH
0.75
eph
0.74
VERT
0.73
PU
0.73
Noir
0.73
Activations Density 0.000%
No Known Activations
This feature has no known activations.