VC-TVG

2023/04/03 Video 共 5371 字,约 16 分钟

总览

Language-Free:

ActivityNet Captions

metricpercentage value
R@0.161.35
R@0.347.61
R@0.532.59
R@0.715.42
mIoU31.85

Charades-STA

metricpercentage value
R@0.1-
R@0.352.95
R@0.537.24
R@0.719.33
mIoU36.05

ActivityNet Captions

train_lf_20

attention_loss = 1.0

不加 ConvBNReLU

ResNet + LN(LN 没有加激活)

Position Embedding(1D sin-cos PE)

BS = 512

Adam:5e-4

cosin_lr: 5e-4 -> 3e-4

epoch:300

  • 0.1 -> 61(63.57) +2
  • 0.3 -> 47 (41.12) -6
  • 0.5 -> 32 (23.64) -9
  • 0.7 -> 15 (6.38) -9
  • mIoU -> 32 (27.37) -5

train_lf_25

attention_loss = 1.0

不使用学习率衰减

lr = 4e-4

$ReLU(LN(Conv1d(MSA(x)) + x))$

添加 Non Local Block

epoch = 300

epoch 176:

  • 0.1 -> 56.35; -5

  • 0.3 -> 35.54; -12

  • 0.5 -> 16.91; -16

  • 0.7 -> 5.45; -10

  • mIoU -> 23.19; -9

train_lf_26

epoch:300

在 25 的基础上,将 text_feats 进行 归一化(norm(dim=-1)),并且在训练阶段添加了一些 高斯噪声(如 Language-Free 中所做的那样)

Loss 图【先降后升,再收敛】

epoch 221:

  • 0.1 -> 61(73.86) +12
  • 0.3 -> 47 (42.97) -4
  • 0.5 -> 32 (24.83) -7
  • 0.7 -> 15 (11.94) -3
  • mIoU -> 32 (31.64) -1

train_lf_27

在 26 的基础上,训到 500 epoch

并且将 LR 从 4e-4 降为 3e-4

结果:与 26 无差别,loss 还是会先降后升,再收敛

train_lf_28

attention_loss = 1.0

Conv1d

Non Local Layer

MSA-LN = ${\color{red}ReLU(LN(MSA(x) + x))}$ ==Post-LN==

Position Embedding(1D sin-cos PE)

BS = 512

Adam:4e-4

cosin_lr: 5e-4 -> 3e-4

epoch:500

添加高斯噪声

  • Post-LN 可能更适合这个任务
  • Non Local Layer 对于精度的提升可能很关键

epoch 403:

  • 0.1 -> 59.17; -2
  • 0.3 -> 38.38; -9
  • 0.5 -> 16.71; -16
  • 0.7 -> 5.46; -10
  • mIoU -> 24.41; -8

train_lf_29

attention_loss = 1.0

Conv1d

Non Local Layer

MSA-LN = ${\color{red}ReLU(x + MSA(LN(x))}$ ==Pre-LN==

Position Embedding(1D sin-cos PE)

BS = 512

Adam:4e-4

cosin_lr: 5e-4 -> 3e-4

epoch:500

添加高斯噪声

epoch 336:

  • 0.1 -> 29.16;
  • 0.3 -> 18.5;
  • 0.5 -> 10.05;
  • 0.7 -> 4.85;
  • mIoU -> 13.52;

train_lf_30

在 29 的基础上,只增加了 R-Drop Loss(alpha=0.7,并且仅针对最终的 localization 结果来计算 Loss)

R-Drop Loss 对这个任务可能有一定的提升

epoch 414:

  • 0.1 -> 36.09;
  • 0.3 -> 21.34;
  • 0.5 -> 11.22;
  • 0.7 -> 5.466;
  • mIoU -> 15.24;

train_lf_31

将 Post-LN、Non Local Layer 以及 R-Drop Loss(alpha=0.7,且仅针对 localization 来计算 loss)结合起来:

attention_loss = 1.0

Conv1d

Non Local Layer

MSA-LN = ${\color{red}ReLU(LN(MSA(x) + x))}$ ==Post-LN==

Position Embedding(1D sin-cos PE)

BS = 128 ==实验室服务器的可用显存不够==

Adam:4e-4

cosin_lr: 5e-4 -> 3e-4

epoch:500

添加高斯噪声

ing ->

【795887】

train_lf_32

与 31 一样,但是将 BS 改为 512

ing -> train_lf_bad

【3132347】


BS = 256

ing ->

【2211727】

【代码最后版本,停留在这里】

train_lf_33

train_lf_33_bad -> LR=4e-4, weight_decay=0.01, BS=512【Shutdown】

将优化器改为 AdamW:

  • LR:1e-4
  • weight decay:1e-4

BS = 512

ing ->

【2180289】

train_lf_34

MSA-LN = ${\color{red}ReLU(x + MSA(LN(x))}$ ==Pre-LN==

【3979994】

ing -> end

【1614056】

  • CLIP4Caption
    • hidden_dim = 512
    • n_heads = 4
    • n_layers = 3
  • CLIP
    • VIT-B/32
    • hidden_dim = 512
  • MSA-LN:$x + MAS(LN(x))$ ==Pre-LN==

  • x = x + MSA(x) ==ResNet==

  • BS = 512
  • epoch = 300
  • Adam
    • lr = 4e-4
  • ActivityNet Captions
  • No Warmup
  • No Non Local Block

(train_lf_34.log)

ing -> bad end

梯度消失

train_lf_35

  • 脚本:train_lf.py
  • 模型:lf_train_module.pyTrainModel
实验名\参数bs & epochAttentionNormNon Local状态结果
tmp_1      
tmp_2      
tmp_3      
       
       

tmp_1

【4046783】

  • bs = 256
  • epoch = 300
  • Adam
    • LR:4e-4
    • Warmup:3;1e-8 -> 4e-4
    • Cosin LR scheduler:4e-4 -> 1e-8
  • Video Aware Cross-Attention
    • Query = Video
    • Key、Value = Text
    • Pre-Norm
    • No FFN
  • Self-Attention
    • Query、Key、Value = Cross-Attened Result
    • Pre-Norm
    • No FFN
  • Non Local:Yes

  • No Gausian Noise

ing ->

(train_lf_35_1.log)

tmp_2

【2692732】

  • BS = 256
  • Epoch = 300
  • Adam
    • lr = 4e-4
    • Warmup:3;1e-8 -> 4e-4
    • No LR scheduler
  • Cross Attention
    • Query:Video
    • Key、Value:Text
    • Post-Norm
    • No FFN
  • Self Attention
    • Query、Key、Value:Cross-Attention Result
    • Post-Norm
    • No FFN
  • Non Local:Yes

ing -> bad end ==梯度消失==

(train_lf_35_2.log)

tmp_3

【3939145】

  • BS = 256
  • Epoch = 300
  • Adam
    • lr = 4e-4
    • Warmup:3;1e-8 -> 4e-4
    • Cosin LR scheduler:4e-4 -> 1e-8
  • Cross Attention
    • Query:Video
    • Key、Value:Text
    • Post-Norm
    • No FFN
  • Self Attention
    • Query、Key、Value:Cross-Attention Result
    • Post-Norm
    • No FFN
  • Non Local:Yes
  • Add Gausian Noise

ing -> bad end

(train_lf_35_3.log)

tmp_4

【3079209】

  • BS = 256
  • Epoch = 300
  • Adam
    • lr = 4e-4
    • No Warmup
    • No LR scheduler
  • Cross Attention
    • Query:Video
    • Key、Value:Text
    • No Norm
    • No FFN
  • Self Attention
    • Query、Key、Value:Cross-Attention Result
    • No Norm
    • No FFN
  • Non Local:Yes
  • Add Gaussian Noise

ing -> bad end

(train_lf_35_4.log)

Traceback (most recent call last):
  File "/root/code/VC-TVG/train_lf.py", line 592, in <module>
    train_lf_activitynet()
  File "/root/code/VC-TVG/train_lf.py", line 203, in train_lf_activitynet
    sum_loss.backward()
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogBackward0' returned nan values in its 0th output.

tmp_5

【】

  • Regression Head:(512, 512)+ ReLU() + (512, 2) + ReLU()

  • BS = 256 -> 512

  • Epoch = 300

  • Adam
    • lr:
      • 4e-4 改为 1e-4
      • 2e-4
    • Warmup:3;1e-8 -> 4e-4 / 1e-8 -> 1e-4
    • LR scheduler:4e-4 -> 1e-8 / 1e-4 -> 1e-8
  • AttentionPooling

    • 使用 ReLU() 激活函数,取代 tanh() 函数 tanh()
    • 取消使用 use_embedding,即 use_embedding=False
  • Cross Attention
    • Query:Video

    • Key、Value:Text

    • Post-Norm

    • 添加 ReLU() 激活

    • No FFN

      attented + residual
      
  • Self Attention
    • Query、Key、Value:Cross-Attention Result

    • Post-Norm

    • 添加 ReLU() 激活

    • No FFN

      attented + residual
      
  • Train
    • transformer_text_feats:768 -> 512
  • Non Local:Yes

  • Add Gaussian Noise

no start now

(train_lf_35_5.log)

tmp_6

【】

将 fusion encoder 的 hidden dimension 从 512 变为 256

  • Regression Head:(256, 256)+ ReLU() + (256, 2) + ReLU()

  • BS = 256
  • Epoch = 300
  • Adam
    • lr = 4e-4
    • Warmup:3;1e-8 -> 4e-4
    • LR scheduler:4e-4 -> 1e-8
  • Cross Attention
    • Query:Video
    • Key、Value:Text
    • Post-Norm
    • No FFN
  • Self Attention
    • Query、Key、Value:Cross-Attention Result
    • Post-Norm
    • No FFN
  • Train
    • transformer_text_feats:768 -> 512
  • text_mlp:512 -> 256
  • video_mlp:256 -> 256
  • Non Local:Yes
  • Add Gaussian Noise

ing -> bad end ==梯度消失==

(train_lf_35_6.log)

Charades-STA

train_lf_charades_34

【4098718】

  • BS = 512
  • Epoch = 500
  • Adam
    • lr = 4e-4
  • Cross Attention
    • Query:Video
    • Key、Value:Text
    • Post-Norm
    • No FFN
  • Self Attention
    • Query、Key、Value:Cross-Attention Result
    • Post-Norm
    • No FFN
  • Non Local:Yes

(train_lf_charades_34.log)

ing ->

文档信息

-->

Search

    Table of Contents