INDEX
Explanations
mentions of a specific keyword "Hel" with varying emphasis indicated by different activation values
references to health-related topics and organizations
New Auto-Interp
Negative Logits
Saber
-0.70
Eag
-0.67
selves
-0.63
negatives
-0.63
surrogate
-0.60
Memories
-0.58
Lowell
-0.58
EED
-0.57
rers
-0.57
æī
-0.55
POSITIVE LOGITS
pless
1.24
mut
1.20
ms
1.17
ped
1.09
ping
1.07
mand
1.04
ios
1.02
iop
0.99
met
0.98
ps
0.96
Activations Density 0.038%