INDEX
Explanations
mentions of specific individuals, particularly those named Robert and Richard
New Auto-Interp
Negative Logits
edis
-0.15
zung
-0.15
amac
-0.15
imore
-0.15
ARTH
-0.14
ogan
-0.14
729
-0.14
ansas
-0.14
ogui
-0.13
oble
-0.13
POSITIVE LOGITS
Predicate
0.14
Bridges
0.14
alian
0.14
iese
0.14
alem
0.14
langs
0.13
aze
0.13
alf
0.13
ulton
0.13
ec
0.13
Activations Density 0.031%