INDEX
Explanations
phrases indicating representation or significance
phrases that indicate representation or significance
New Auto-Interp
Negative Logits
jar
-0.64
intend
-0.61
ita
-0.61
roit
-0.60
imb
-0.60
ibel
-0.60
erm
-0.60
erman
-0.60
behaved
-0.60
ithing
-0.60
POSITIVE LOGITS
Interstitial
0.83
an
0.77
a
0.73
something
0.70
orically
0.68
reement
0.67
Operation
0.66
sacrifices
0.66
salvation
0.66
agi
0.66
Activations Density 0.072%