INDEX
Explanations
instances of the word "names."
names or labels
repeated mentions of the word "names."
New Auto-Interp
Negative Logits
yrinth
-0.71
irth
-0.70
Bed
-0.69
OPLE
-0.67
Yar
-0.66
Smy
-0.65
romy
-0.64
UGE
-0.62
Forestry
-0.62
idth
-0.61
POSITIVE LOGITS
paces
1.58
pace
1.10
names
1.04
plates
1.01
aliases
1.01
paced
0.96
ames
0.96
hips
0.92
akes
0.91
names
0.88
Activations Density 0.015%