INDEX
Explanations
actions or intentions that are followed by attempts to do something
actions related to attempts or efforts to do something
New Auto-Interp
Negative Logits
hander
-0.80
workers
-0.77
eyes
-0.76
houses
-0.75
spr
-0.75
era
-0.68
room
-0.68
lined
-0.68
father
-0.67
front
-0.67
POSITIVE LOGITS
unsuccessfully
1.12
resusc
0.83
suicide
0.79
URES
0.78
URE
0.77
llor
0.76
Attempts
0.74
emulate
0.74
relocation
0.71
ossibility
0.70
Activations Density 0.041%