INDEX
Explanations
conjunctions that indicate contrast or opposition
New Auto-Interp
Negative Logits
erno
-0.15
nicht
-0.15
essen
-0.14
Bold
-0.14
esson
-0.14
नह
-0.14
adiens
-0.14
erset
-0.14
не
-0.14
không
-0.14
POSITIVE LOGITS
indeed
0.20
rather
0.19
лÑİ
0.16
htub
0.16
legg
0.16
Rather
0.15
Rather
0.15
ÃĹ↵↵
0.15
ingleton
0.14
mo
0.14
Activations Density 0.050%