INDEX
Explanations
references to mistreatment or abuse in relation to visa status
New Auto-Interp
Negative Logits
('-0.32
–
-0.32
("-0.29
(~
-0.27
(
-0.26
(«
-0.25
‘
-0.23
'
-0.23
â̦"
-0.23
(&
-0.22
POSITIVE LOGITS
--↵
0.31
----↵
0.30
....↵
0.27
-----↵
0.24
....
0.23
......
0.23
....↵↵
0.22
.....↵↵
0.22
---↵
0.21
—↵↵
0.20
Activations Density 0.005%