INDEX
Explanations
mentions of specific textual locations — starting references or origins
New Auto-Interp
Negative Logits
merce
-0.89
faced
-0.83
ratulations
-0.83
rils
-0.75
irm
-0.74
hai
-0.73
busters
-0.73
tailed
-0.70
seek
-0.68
heet
-0.68
POSITIVE LOGITS
afar
1.72
whence
1.28
scratch
1.18
abroad
1.16
inside
1.01
thence
0.95
within
0.94
somewhere
0.92
anywhere
0.90
outside
0.88
Activations Density 0.546%