INDEX
Explanations
words related to falling or decline
New Auto-Interp
Negative Logits
ed
-0.27
ores
-0.24
of
-0.23
o
-0.23
off
-0.23
ovice
-0.22
eer
-0.22
ovich
-0.21
oi
-0.21
oit
-0.20
POSITIVE LOGITS
llll
0.33
l
0.31
ows
0.29
IGENCE
0.25
eries
0.23
t
0.23
ll
0.23
usions
0.22
mann
0.22
ustr
0.21
Activations Density 0.091%