INDEX
Explanations
phrases indicating identity, emotional reactions, or situational contexts
New Auto-Interp
Negative Logits
-0.41
are
-0.40
P
-0.39
</h2>
-0.38
bad
-0.38
s
-0.38
sur
-0.38
du
-0.38
-
-0.37
Paul
-0.37
POSITIVE LOGITS
propOrder
1.28
myſelf
1.19
متعلقه
1.16
Monfieur
1.00
ſelf
0.99
Jefus
0.95
Efq
0.94
itſelf
0.94
ſelves
0.94
ſche
0.94
Activations Density 0.082%