INDEX
Explanations
words that convey evaluations or judgments about people or situations
New Auto-Interp
Negative Logits
oe
-0.16
892
-0.15
ayo
-0.14
urtle
-0.14
awan
-0.14
rij
-0.14
hoping
-0.13
Gig
-0.13
xab
-0.13
odied
-0.13
POSITIVE LOGITS
result
0.29
due
0.29
due
0.28
based
0.26
resultado
0.25
_due
0.24
thanks
0.23
resultat
0.23
driven
0.23
thanks
0.23
Activations Density 0.012%