INDEX
Explanations
expressions of emotional avoidance or denial of responsibility
New Auto-Interp
Negative Logits
opia
-0.19
fooled
-0.16
ivor
-0.15
onsense
-0.15
934
-0.15
hin
-0.15
UNCTION
-0.14
ún
-0.14
acam
-0.14
tolerate
-0.14
POSITIVE LOGITS
alien
0.45
Alien
0.33
alien
0.31
Ali
0.28
antagon
0.26
aliens
0.26
Ali
0.26
risk
0.25
anger
0.24
risk
0.22
Activations Density 0.217%