INDEX
Explanations
names or initials of individuals
New Auto-Interp
Negative Logits
Eſ
-0.99
Beſ
-0.94
Anſ
-0.92
Theſe
-0.91
Conſ
-0.91
againſt
-0.87
#+#
-0.87
Reſ
-0.87
itſelf
-0.86
Perſ
-0.85
POSITIVE LOGITS
J
0.88
C
0.82
K
0.81
W
0.81
O
0.80
D
0.78
A
0.78
M
0.77
B
0.77
G
0.76
Activations Density 0.152%