INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
fucked
-0.23
fuck
-0.21
shitty
-0.21
Fuck
-0.20
FUCK
-0.20
fucking
-0.19
Fucking
-0.19
fuck
-0.18
shit
-0.17
Fuck
-0.17
POSITIVE LOGITS
krom
0.19
conservatism
0.16
aÄį
0.15
.identity
0.15
simply
0.15
оÑģÑĤи
0.15
chl
0.14
understood
0.14
conservatives
0.14
igon
0.14
Activations Density 0.000%
No Known Activations
This feature has no known activations.