INDEX
Explanations
phrases that contradict a preceding statement or expectation
negations and phrases indicating what something is not
New Auto-Interp
Negative Logits
ãģ®ç
-0.73
ngth
-0.69
çīĪ
-0.67
ividual
-0.67
icks
-0.64
éĥ
-0.62
showc
-0.62
Links
-0.61
wives
-0.60
agos
-0.60
POSITIVE LOGITS
necessarily
1.08
happening
0.91
rael
0.82
exactly
0.80
icable
0.78
uncommon
0.78
actly
0.77
coincidence
0.76
happ
0.74
detract
0.73
Activations Density 0.107%