INDEX
Negative Logits
ac
-1.12
ac
-1.05
Ac
-1.02
AC
-0.83
Art
-0.80
Ac
-0.78
(
-0.67
acc
-0.67
original
-0.66
Art
-0.66
POSITIVE LOGITS
Houſe
1.59
Theſe
1.52
Diſ
1.46
Jefus
1.45
Efq
1.45
Majefty
1.42
myſelf
1.41
Anſ
1.38
himſelf
1.37
Reſ
1.35
Activations Density 0.368%