INDEX
Explanations
words related to safety, protection, or refuge
the word "haven" in various contexts
New Auto-Interp
Negative Logits
upp
-0.64
otype
-0.63
ractical
-0.61
onel
-0.61
activation
-0.60
ahon
-0.60
thickness
-0.60
othy
-0.60
ulu
-0.59
rophe
-0.59
POSITIVE LOGITS
't
0.96
geon
0.87
cheon
0.83
ned
0.80
gotten
0.79
tyard
0.79
itals
0.79
ajor
0.77
itarian
0.75
edin
0.74
Activations Density 0.024%