INDEX
Explanations
references to authority figures or influential individuals
New Auto-Interp
Negative Logits
cco
-0.16
ouz
-0.15
oard
-0.15
atrice
-0.15
ombok
-0.15
arie
-0.14
ÏĤ
-0.14
fix
-0.14
allah
-0.14
Brilliant
-0.14
POSITIVE LOGITS
sturdy
0.20
functionalities
0.18
men
0.18
paced
0.16
invalid
0.16
Scotch
0.15
turbulent
0.15
honest
0.15
gentle
0.15
ye
0.15
Activations Density 0.395%