INDEX
Explanations
references to social and cultural phenomena
New Auto-Interp
Negative Logits
overall
-0.20
969
-0.15
overall
-0.15
initially
-0.15
yani
-0.14
initial
-0.14
özellikle
-0.14
907
-0.14
adulte
-0.14
ertas
-0.14
POSITIVE LOGITS
instead
0.19
while
0.19
while
0.19
instead
0.17
WHILE
0.17
uzzi
0.16
_while
0.15
ecause
0.15
whilst
0.15
because
0.15
Activations Density 0.726%