INDEX
Explanations
actions or tasks being done
instances of the phrase "do it."
New Auto-Interp
Negative Logits
Opposition
-0.68
Flavoring
-0.67
Returning
-0.63
²
-0.60
Represent
-0.60
Ware
-0.58
opinions
-0.58
advisors
-0.57
ONSORED
-0.56
Arm
-0.56
POSITIVE LOGITS
alian
1.09
justice
0.83
alia
0.81
chy
0.81
wrong
0.81
pez
0.79
self
0.79
lez
0.77
differently
0.76
unes
0.76
Activations Density 0.049%