INDEX
Explanations
phrases related to instructions or suggestions
New Auto-Interp
Negative Logits
ĸļ
-0.77
supposedly
-0.61
ulner
-0.61
unda
-0.60
ALWAYS
-0.59
Apps
-0.58
Tube
-0.56
jab
-0.56
ruction
-0.56
evidently
-0.55
POSITIVE LOGITS
someday
1.08
depending
0.78
tempted
0.77
ivably
0.72
slightly
0.72
inadvertently
0.71
xus
0.71
underest
0.70
momentarily
0.67
depending
0.67
Activations Density 0.313%