INDEX
Explanations
references to individuals involved in controversies or allegations
New Auto-Interp
Negative Logits
envy
-0.64
isters
-0.59
ICLE
-0.58
ificial
-0.58
istg
-0.57
ACTED
-0.54
oleon
-0.54
uador
-0.53
icial
-0.52
pity
-0.51
POSITIVE LOGITS
chuk
0.78
idge
0.75
gow
0.71
yk
0.70
gat
0.66
bottom
0.66
dal
0.64
kov
0.62
bee
0.62
yi
0.61
Activations Density 0.041%