INDEX
Explanations
specifies actions and their consequences
New Auto-Interp
Negative Logits
set
-1.44
mis
-1.39
Re
-1.34
had
-1.30
took
-1.27
made
-1.27
le
-1.26
Her
-1.25
h
-1.25
si
-1.24
POSITIVE LOGITS
всички
1.83
tunik
1.72
bluz
1.70
OGSÅ
1.70
superbes
1.70
mainly
1.62
karier
1.61
cewek
1.60
incrí
1.59
Mainly
1.56
Activations Density 0.009%