INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
-
-0.29
--
-0.27
--
-0.22
---
-0.19
...↵↵
-0.19
...
-0.19
...↵
-0.16
:
-0.16
ighbor
-0.16
...
-0.16
POSITIVE LOGITS
fucking
0.22
fuck
0.22
fucks
0.22
fucked
0.21
Fuck
0.20
Fuck
0.18
fuck
0.18
FUCK
0.18
ิà¸ļ
0.17
Streaming
0.17
Activations Density 0.000%
No Known Activations
This feature has no known activations.