INDEX
Explanations
information related to political figures and their statements
New Auto-Interp
Negative Logits
selves
-0.79
selves
-0.69
theirs
-0.67
Parenthood
-0.65
animate
-0.65
decay
-0.65
destruct
-0.63
Reviewer
-0.62
inferior
-0.61
Daddy
-0.61
POSITIVE LOGITS
himself
0.93
quoted
0.92
referring
0.92
speaking
0.84
interviewed
0.83
overseeing
0.83
cited
0.82
firsthand
0.82
recommending
0.81
personally
0.81
Activations Density 0.580%