INDEX
Explanations
questions and concerns about technological implementation and safety
New Auto-Interp
Negative Logits
Efq
-1.46
Jefus
-1.45
myſelf
-1.35
Monfieur
-1.30
Eſ
-1.26
ſelf
-1.25
$_"
-1.24
chofe
-1.22
Anſ
-1.22
Reſ
-1.21
POSITIVE LOGITS
…
1.11
…
1.04
<eos>
1.04
...
1.01
...
0.98
</i>
0.95
….
0.88
....
0.87
……
0.87
[…]
0.86
Activations Density 0.127%