INDEX
Explanations
instances of strong negative emotions or harsh language
New Auto-Interp
Negative Logits
)|^{-0.69
GIP
-0.63
abestanden
-0.61
—
-0.61
eradish
-0.61
&___
-0.59
Sanger
-0.59
зульта
-0.59
snippetHide
-0.58
quí
-0.58
POSITIVE LOGITS
ine
0.74
SourceChecksum
0.66
ویکیپدی
0.64
boarding
0.57
master
0.55
quedarse
0.55
baga
0.54
belast
0.54
+#+#
0.54
mm
0.53
Activations Density 0.068%