INDEX
Explanations
references to familial relationships and personal history
New Auto-Interp
Negative Logits
lich
-0.18
re
-0.17
ence
-0.17
ul
-0.16
au
-0.15
sh
-0.15
elic
-0.15
enc
-0.14
lic
-0.14
liche
-0.14
POSITIVE LOGITS
uer
0.26
cken
0.22
iÃŁ
0.21
iste
0.21
ÃŁ
0.21
ifen
0.21
inen
0.20
ilen
0.20
ibe
0.20
ines
0.20
Activations Density 0.028%