INDEX
Explanations
examples or instances of different concepts or situations
phrases that signify examples or instances
New Auto-Interp
Negative Logits
izons
-0.85
ulum
-0.74
lene
-0.74
ossier
-0.69
earchers
-0.69
houses
-0.68
cles
-0.67
culosis
-0.67
hya
-0.66
Tours
-0.66
POSITIVE LOGITS
collateral
0.89
unintended
0.84
heroism
0.83
how
0.76
plagiar
0.76
constructive
0.75
blatant
0.73
hypocrisy
0.72
spontaneous
0.71
divergence
0.71
Activations Density 0.116%