INDEX
Explanations
expressions of dissent and criticism toward authority or proposals
New Auto-Interp
Negative Logits
sounds
-0.15
mare
-0.14
uter
-0.14
etus
-0.13
inand
-0.13
iams
-0.13
ná»ķi
-0.13
interess
-0.13
isi
-0.13
ansen
-0.13
POSITIVE LOGITS
perceived
0.32
lack
0.28
lack
0.23
practices
0.22
decision
0.22
decisions
0.22
handling
0.21
alleged
0.21
way
0.20
treatment
0.20
Activations Density 0.253%