INDEX
Explanations
sections of text that contain the character "-" followed by a non-zero activation
New Auto-Interp
Negative Logits
sizeCache
-1.00
démocr
-0.92
$")
-0.86
)");
-0.86
ModelExpression
-0.84
Efq
-0.84
Савезне
-0.83
étoit
-0.83
étoient
-0.83
$_"
-0.83
POSITIVE LOGITS
-
0.64
(
0.59
*
0.52
0.52
↵
0.51
–
0.49
0.49
_
0.48
GenerationType
0.47
<eos>
0.46
Activations Density 0.216%