INDEX
Explanations
words related to pills or medications
New Auto-Interp
Negative Logits
ed
-0.25
eties
-0.20
eme
-0.19
emu
-0.18
eer
-0.18
etes
-0.18
yne
-0.18
ely
-0.18
emed
-0.17
emp
-0.17
POSITIVE LOGITS
iard
0.32
owy
0.28
ings
0.27
umin
0.26
iams
0.26
inois
0.25
iterate
0.24
l
0.24
ard
0.24
ows
0.24
Activations Density 0.080%