INDEX
Explanations
phrases that indicate success or notable events
New Auto-Interp
Negative Logits
Monfieur
-0.84
Theſe
-0.84
myſelf
-0.74
GrantedAuthority
-0.73
Paglinawan
-0.72
Jefus
-0.72
клопе
-0.72
Shakspeare
-0.71
Diſ
-0.70
iſt
-0.69
POSITIVE LOGITS
now
0.61
now
0.53
AndEndTag
0.46
Now
0.44
.
0.43
<strong>
0.43
Now
0.42
up
0.42
p
0.41
n
0.41
Activations Density 0.579%