INDEX
Explanations
phrases related to contradictions or conflicts
terms related to contradiction or conflict
New Auto-Interp
Negative Logits
NetMessage
-0.77
throats
-0.72
istics
-0.70
eele
-0.69
Mehran
-0.68
GOODMAN
-0.65
Nanto
-0.65
Assass
-0.64
doms
-0.64
Parenthood
-0.63
POSITIVE LOGITS
ptions
1.10
contra
1.06
ption
1.05
ven
0.93
asca
0.88
ventions
0.83
vention
0.77
offensive
0.76
ctr
0.73
xon
0.73
Activations Density 0.015%