INDEX
Explanations
statements expressing support or validation
phrases related to support or validation
New Auto-Interp
Negative Logits
nel
-0.75
anny
-0.73
ities
-0.70
newsp
-0.69
pox
-0.68
inational
-0.65
Hebdo
-0.65
irony
-0.64
entric
-0.63
ILCS
-0.63
POSITIVE LOGITS
raise
0.75
hard
0.75
track
0.73
byn
0.73
abies
0.71
drive
0.69
ament
0.67
taking
0.66
GROUND
0.65
lash
0.65
Activations Density 0.038%