INDEX
Explanations
titles or references to authoritative figures
references to authority figures and their roles, particularly those related to the term "lord" and "woman"
New Auto-Interp
Negative Logits
¥ŀ
-0.86
unes
-0.66
skelet
-0.65
Citiz
-0.65
Palestin
-0.64
widest
-0.61
itialized
-0.60
lightweight
-0.60
insulation
-0.59
ogi
-0.59
POSITIVE LOGITS
lord
0.97
hood
0.85
lords
0.81
hyde
0.80
pool
0.76
der
0.76
ëĭ
0.75
hattan
0.75
ifest
0.74
ipop
0.74
Activations Density 0.022%