INDEX
Explanations
academic titles and professions
references to professors and doctors
New Auto-Interp
Negative Logits
OPA
-0.75
routed
-0.73
takeoff
-0.73
destro
-0.69
pony
-0.68
tumblr
-0.68
sled
-0.68
actionGroup
-0.67
tumble
-0.67
spir
-0.66
POSITIVE LOGITS
essors
1.08
Joseph
0.84
Jonathan
0.79
agher
0.79
Peter
0.78
Gabriel
0.78
Jorge
0.78
Daniel
0.78
Timothy
0.77
emer
0.77
Activations Density 0.060%