INDEX
Explanations
mentions of authority figures and their actions
New Auto-Interp
Negative Logits
iggins
-0.15
ì§ģ
-0.15
æŀ
-0.15
ł
-0.14
ttl
-0.14
asca
-0.14
brtc
-0.14
chw
-0.13
stalk
-0.13
iera
-0.13
POSITIVE LOGITS
701
0.17
Fol
0.17
Tro
0.17
Bates
0.16
971
0.15
tro
0.14
Perm
0.14
arious
0.14
ata
0.14
torture
0.14
Activations Density 0.042%