INDEX
Explanations
expressions related to the understanding or fairness of a situation
expressions of justification or reasonableness
New Auto-Interp
Negative Logits
bowling
-0.77
downed
-0.67
worm
-0.65
craft
-0.61
virginity
-0.61
ngth
-0.60
bows
-0.59
infect
-0.59
Dur
-0.59
stars
-0.58
POSITIVE LOGITS
assume
0.72
inference
0.71
ensibly
0.69
assumption
0.68
ATOR
0.68
conclude
0.67
allery
0.67
infer
0.66
Luxem
0.66
guiActive
0.66
Activations Density 0.139%