INDEX
Explanations
references to metaphorical language and concepts
New Auto-Interp
Negative Logits
lator
-0.16
ervals
-0.15
oll
-0.15
.rf
-0.14
brush
-0.14
ering
-0.14
san
-0.14
iba
-0.14
rado
-0.13
ámara
-0.13
POSITIVE LOGITS
-Clause
0.15
quo
0.14
iminal
0.14
PerPixel
0.14
ichel
0.14
_blocked
0.14
celik
0.13
AXB
0.13
Sof
0.13
quette
0.13
Activations Density 0.023%