INDEX
Explanations
references to benefits or advantageous outcomes
New Auto-Interp
Negative Logits
adoo
-0.17
adox
-0.15
ivas
-0.14
onga
-0.14
oupper
-0.14
minate
-0.14
quito
-0.14
rouch
-0.14
quiv
-0.14
имÑĥ
-0.14
POSITIVE LOGITS
actors
0.34
actor
0.34
itted
0.30
action
0.23
actions
0.23
icia
0.23
acting
0.22
eci
0.20
actors
0.20
Actors
0.20
Activations Density 0.008%