INDEX
Explanations
phrases related to public opinions and reactions
discussions related to societal issues and perceptions of privilege
New Auto-Interp
Negative Logits
nonetheless
-0.78
etheless
-0.75
utsche
-0.71
acknowledgment
-0.69
acknowledgement
-0.68
acknowledges
-0.67
»Ĵ
-0.64
Regions
-0.64
VICE
-0.62
Ezek
-0.61
POSITIVE LOGITS
unbeat
0.90
boring
0.88
\"
0.86
inferior
0.82
crazy
0.80
invincible
0.79
unfairly
0.79
retarded
0.78
underrated
0.78
bad
0.76
Activations Density 0.618%