INDEX
Explanations
commands or instructions
New Auto-Interp
Negative Logits
accompanies
-0.79
livest
-0.69
constitu
-0.68
advertised
-0.66
agre
-0.64
idding
-0.61
nesota
-0.59
nar
-0.59
coerc
-0.58
dissatisf
-0.57
POSITIVE LOGITS
aways
1.26
advantage
1.10
away
0.94
heed
0.93
uchi
0.91
aback
0.89
care
0.84
overs
0.82
prising
0.80
frey
0.75
Activations Density 0.042%