INDEX
Explanations
phrases indicating requirement or necessity
New Auto-Interp
Negative Logits
americ
-0.74
diction
-0.68
cart
-0.67
oci
-0.65
Liter
-0.65
cript
-0.65
concess
-0.64
laughter
-0.62
"""
-0.61
uras
-0.61
POSITIVE LOGITS
prove
1.00
regain
0.94
overcome
0.93
patience
0.91
convince
0.90
hurry
0.86
rematch
0.86
retake
0.84
succeed
0.80
earn
0.80
Activations Density 0.111%