INDEX
Explanations
names and titles related to historical or cultural figures
New Auto-Interp
Negative Logits
ModuleName
-0.15
.oracle
-0.15
tre
-0.15
.selector
-0.15
amak
-0.15
REP
-0.14
tre
-0.14
ALER
-0.14
xac
-0.14
arih
-0.14
POSITIVE LOGITS
chen
0.17
Gi
0.15
Ned
0.14
SPDX
0.14
XD
0.14
egend
0.14
ismus
0.14
Lud
0.13
lei
0.13
xford
0.13
Activations Density 0.091%