INDEX
Explanations
strong statements opposing violence and advocating for human rights
New Auto-Interp
Negative Logits
inding
-0.14
quadr
-0.14
really
-0.14
æĿİ
-0.14
inda
-0.14
exactly
-0.14
hardly
-0.14
498
-0.13
uben
-0.13
Tap
-0.13
POSITIVE LOGITS
tolerate
0.25
toler
0.24
acceptable
0.24
tolerated
0.23
_tolerance
0.22
tolerance
0.21
Accept
0.21
tol
0.20
Accept
0.20
ACCEPT
0.20
Activations Density 0.208%