INDEX
Explanations
phrases indicating strength or certainty
New Auto-Interp
Negative Logits
ly
-0.17
/layouts
-0.16
ettle
-0.15
ooks
-0.15
errupt
-0.14
olut
-0.14
наÑĩе
-0.14
алÑĸз
-0.14
нев
-0.14
rij
-0.14
POSITIVE LOGITS
indeed
0.56
fact
0.48
actually
0.43
inf
0.40
Indeed
0.39
Indeed
0.37
inde
0.37
fact
0.35
totiž
0.35
actually
0.35
Activations Density 0.049%