INDEX
Explanations
expressions of trust or mistrust
New Auto-Interp
Negative Logits
poffible
-1.48
myſelf
-1.46
Diſ
-1.44
Houſe
-1.43
Theſe
-1.42
greateſt
-1.40
ſmall
-1.39
Anſ
-1.38
Reſ
-1.37
houſe
-1.35
POSITIVE LOGITS
trust
0.81
0.62
<eos>
0.61
and
0.60
0.60
(
0.57
design
0.57
as
0.56
,
0.56
in
0.55
Activations Density 0.141%