INDEX
Explanations
personal names within a context of narrative or dialogue
words associated with discussions of morality or ethical reasoning
New Auto-Interp
Negative Logits
pestic
-0.88
mathemat
-0.86
horizont
-0.78
iatus
-0.77
raints
-0.76
incorpor
-0.75
myster
-0.75
explan
-0.75
disadvant
-0.74
welf
-0.73
POSITIVE LOGITS
ï¸ı
1.00
âĹ¼
0.91
ãĥĥãĥī
0.91
rd
0.88
deg
0.86
log
0.86
ĺ
0.84
fter
0.83
hair
0.78
é¾į
0.78
Activations Density 0.035%