INDEX
Explanations
sections of text that contain no activations, indicating a lack of any significant content
Code or math-related characters
legal evidence
New Auto-Interp
Negative Logits
[…]
-1.11
…
-0.93
-0.84
<eos>
-0.71
...
-0.70
↵↵
-0.67
[…]
-0.65
-0.64
.
-0.63
…
-0.62
POSITIVE LOGITS
Савезне
1.73
pleaſure
1.34
Majefty
1.29
purpoſe
1.28
myſelf
1.26
Мексичка
1.26
Personensuche
1.25
ſelves
1.24
tagHelperRunner
1.23
ſelf
1.23
Activations Density 0.002%