INDEX
Explanations
phrases indicating causation or reasons for consequences
New Auto-Interp
Negative Logits
ewith
-0.15
idi
-0.15
usa
-0.14
akov
-0.14
idis
-0.14
haven
-0.13
Prefer
-0.13
ux
-0.13
auce
-0.13
ÙĨاء
-0.13
POSITIVE LOGITS
partially
0.35
partly
0.34
party
0.28
least
0.24
part
0.23
largely
0.22
Party
0.22
atleast
0.22
جزئ
0.21
Partial
0.21
Activations Density 0.073%