INDEX
Explanations
mentions of actions or instructions
New Auto-Interp
Negative Logits
ndra
-0.85
ambo
-0.82
tions
-0.80
-+-+
-0.77
ntil
-0.70
Ü
-0.68
nell
-0.65
otten
-0.64
aza
-0.64
ategories
-0.64
POSITIVE LOGITS
stride
1.02
cues
1.02
seriously
1.00
plunge
0.97
reins
0.96
cue
0.95
liberties
0.92
virginity
0.88
aback
0.86
lightly
0.84
Activations Density 1.518%