INDEX
Explanations
adjectives or verbs indicating a change in ease or difficulty of a certain action
phrases indicating the relative difficulty or ease of various actions or regulations
New Auto-Interp
Negative Logits
notations
-0.65
Originally
-0.63
ilo
-0.63
leigh
-0.60
agraph
-0.60
milo
-0.60
Blu
-0.58
Difference
-0.57
wings
-0.56
kind
-0.56
POSITIVE LOGITS
for
0.79
to
0.78
punishable
0.71
IBLE
0.67
ible
0.65
enged
0.64
prey
0.63
for
0.63
easier
0.62
anced
0.62
Activations Density 0.076%