INDEX
Explanations
references to social or cultural commentary
New Auto-Interp
Negative Logits
meanwhile
-0.16
however
-0.16
,
-0.16
mixed
-0.15
athe
-0.14
i
-0.14
alth
-0.14
isan
-0.14
idal
-0.14
wert
-0.14
POSITIVE LOGITS
ailability
0.16
782
0.15
forge
0.15
REFIX
0.14
AxisAlignment
0.14
ụn
0.14
Ä©
0.14
Forge
0.13
versa
0.13
llib
0.13
Activations Density 0.135%