INDEX
Negative Logits
Tempor
0.38
Tempor
0.37
Genuine
0.36
lens
0.34
disinformation
0.34
undergraduate
0.34
coworker
0.34
andus
0.34
waxes
0.34
distrust
0.33
POSITIVE LOGITS
break
0.69
snacks
0.64
served
0.64
🍴
0.63
ब्रेक
0.62
🍽
0.62
eaten
0.61
Break
0.59
🍱
0.59
overlooking
0.58
Activations Density 0.024%