INDEX
Explanations
specific references or citations in a structured format
New Auto-Interp
Negative Logits
C
-0.65
in
-0.61
di
-0.57
S
-0.55
L
-0.55
to
-0.54
-0.53
A
-0.52
Di
-0.51
IN
-0.51
POSITIVE LOGITS
myſelf
1.33
ſelf
1.31
itſelf
1.30
Majefty
1.29
Monfieur
1.21
pleaſure
1.19
ſever
1.17
ſeveral
1.17
Efq
1.16
reaſon
1.14
Activations Density 0.887%