INDEX
Explanations
prepositional phrases indicating a specific type of action or behavior
phrases that express various types of attention or critique
New Auto-Interp
Negative Logits
Pigs
-0.75
Crush
-0.72
gow
-0.70
pots
-0.66
Sands
-0.65
Rocks
-0.64
gor
-0.64
APS
-0.64
mates
-0.62
ours
-0.61
POSITIVE LOGITS
thing
1.02
scenario
0.81
behavior
0.75
stuff
0.73
nonsense
0.72
activity
0.71
attrition
0.70
kindred
0.69
tnc
0.69
fate
0.68
Activations Density 0.033%