INDEX
Explanations
phrases related to physical actions or descriptions
pronouns related to people
New Auto-Interp
Negative Logits
Gerard
-0.69
Mit
-0.67
Majority
-0.66
Emer
-0.65
Cliff
-0.64
Pompe
-0.64
Beacon
-0.64
Junk
-0.64
Anat
-0.63
Petraeus
-0.63
POSITIVE LOGITS
arers
0.98
pton
0.91
ÃĥÃĤ
0.91
ared
0.90
've
0.90
arer
0.89
til
0.88
'll
0.86
hots
0.85
ldon
0.84
Activations Density 0.305%