INDEX
Explanations
words related to dehumanization and its effects
New Auto-Interp
Negative Logits
yt
-0.17
ames
-0.16
itsu
-0.15
YT
-0.15
wei
-0.15
дов
-0.14
lian
-0.14
anford
-0.14
reme
-0.14
ittle
-0.14
POSITIVE LOGITS
facto
0.18
urgeon
0.15
rig
0.15
de
0.15
æľĭ
0.15
Decomp
0.15
æŃ
0.14
ัà¸ģà¸Ĺ
0.14
ognito
0.14
eam
0.14
Activations Density 0.040%