INDEX
Explanations
instances of assertions or claims being made and their reliability
New Auto-Interp
Negative Logits
<strong>
-0.78
po
-0.76
w
-0.76
老
-0.75
ma
-0.74
ge
-0.71
lot
-0.71
g
-0.71
s
-0.70
e
-0.68
POSITIVE LOGITS
myſelf
1.46
itſelf
1.33
purpoſe
1.30
pleaſure
1.22
Monfieur
1.21
ſelf
1.21
reaſon
1.21
themſelves
1.17
obſ
1.17
'\\;'
1.17
Activations Density 0.134%