INDEX
Explanations
mentions of the word "Stone" with varying activations
instances of the word "Stone."
New Auto-Interp
Negative Logits
oresc
-0.79
merce
-0.77
olulu
-0.76
ornia
-0.75
orescence
-0.74
ntil
-0.72
unal
-0.70
unct
-0.69
ulate
-0.68
ership
-0.67
POSITIVE LOGITS
falls
0.91
hill
0.88
works
0.83
lings
0.83
Cold
0.83
ring
0.81
hook
0.81
asure
0.81
asures
0.80
Age
0.79
Activations Density 0.019%