INDEX
Explanations
instances where actions are being performed or suggested
the action of taking or similar verb forms related to actions performed
New Auto-Interp
Negative Logits
Cong
-0.68
mith
-0.65
idding
-0.65
Seg
-0.62
linked
-0.60
Develop
-0.60
illing
-0.59
eman
-0.59
eers
-0.59
gian
-0.58
POSITIVE LOGITS
advantage
1.23
precautions
1.09
care
1.07
refuge
1.05
baths
1.04
aways
1.03
selfies
1.00
liberties
0.98
aback
0.95
shortcuts
0.94
Activations Density 0.112%