INDEX
Explanations
phrases indicating restraint or holding back
New Auto-Interp
Negative Logits
jar
-0.17
Jar
-0.16
esses
-0.16
jar
-0.15
jars
-0.15
Lama
-0.15
jev
-0.15
ruh
-0.14
FLAGS
-0.14
Jar
-0.14
POSITIVE LOGITS
Tac
0.19
until
0.17
illac
0.15
Until
0.14
Tac
0.14
tac
0.14
hasta
0.14
ÄĽtÅ¡
0.14
Until
0.14
sidelines
0.14
Activations Density 0.219%