INDEX
Explanations
code, actions, and attention
New Auto-Interp
Negative Logits
2
0.94
Process
0.90
4
0.90
1
0.84
Hundreds
0.81
PI
0.79
8
0.79
9
0.79
Constructed
0.78
7
0.77
POSITIVE LOGITS
ominous
0.91
juegos
0.87
먼
0.87
agrav
0.87
financiera
0.84
juegan
0.84
ны
0.82
smoky
0.82
viajes
0.82
horned
0.80
Activations Density 0.001%