INDEX
Explanations
contradictions and nuances in arguments
New Auto-Interp
Negative Logits
not
-0.21
ä¸į
-0.20
NOT
-0.20
nicht
-0.19
không
-0.19
not
-0.18
niet
-0.17
не
-0.17
icz
-0.17
unch
-0.16
POSITIVE LOGITS
rather
0.44
Rather
0.41
Rather
0.39
rather
0.38
instead
0.35
Instead
0.33
Instead
0.33
naopak
0.32
sondern
0.32
instead
0.28
Activations Density 0.290%