From e60957cb30d9568938068948c61b29b1b2d07b87 Mon Sep 17 00:00:00 2001 From: "mingjiang.li" Date: Mon, 10 Mar 2025 15:02:53 +0800 Subject: [PATCH 1/9] unify model readme format - cv/pose --- cv/pose/alphapose/pytorch/README.md | 67 ++++++++------- cv/pose/hrnet/paddlepaddle/README.md | 59 +++++++------ cv/pose/hrnet/pytorch/README.md | 47 +++++----- cv/pose/openpose/mindspore/README.md | 85 ++++++++++--------- .../mae/pytorch/README.md | 51 ++++++----- 5 files changed, 169 insertions(+), 140 deletions(-) diff --git a/cv/pose/alphapose/pytorch/README.md b/cv/pose/alphapose/pytorch/README.md index a85b2257f..20ebc61f8 100755 --- a/cv/pose/alphapose/pytorch/README.md +++ b/cv/pose/alphapose/pytorch/README.md @@ -1,34 +1,22 @@ # AlphaPose -## Model description +## Model Description -AlphaPose is an accurate multi-person pose estimator, which is the first open-source system that achieves 70+ mAP (75 mAP) on COCO dataset and 80+ mAP (82.1 mAP) on MPII dataset. To match poses that correspond to the same person across frames, we also provide an efficient online pose tracker called Pose Flow. It is the first open-source online pose tracker that achieves both 60+ mAP (66.5 mAP) and 50+ MOTA (58.3 MOTA) on PoseTrack Challenge dataset. +AlphaPose is an accurate multi-person pose estimator, which is the first open-source system that achieves 70+ mAP (75 +mAP) on COCO dataset and 80+ mAP (82.1 mAP) on MPII dataset. To match poses that correspond to the same person across +frames, we also provide an efficient online pose tracker called Pose Flow. It is the first open-source online pose +tracker that achieves both 60+ mAP (66.5 mAP) and 50+ MOTA (58.3 MOTA) on PoseTrack Challenge dataset. -## Step 1: Installation +## Model Preparation -```bash -# install libGL -yum install mesa-libGL +### Prepare Resources -# install zlib -wget http://www.zlib.net/fossils/zlib-1.2.9.tar.gz -tar xvf zlib-1.2.9.tar.gz -cd zlib-1.2.9/ -./configure && make install -cd .. -rm -rf zlib-1.2.9.tar.gz zlib-1.2.9/ +Go to visit [COCO official website](https://cocodataset.org/#download), then select the COCO dataset you want to +download. -# install requirements -pip3 install seaborn pandas pycocotools matplotlib -pip3 install easydict tensorboardX opencv-python -``` - -## Step 2: Preparing datasets - -Go to visit [COCO official website](https://cocodataset.org/#download), then select the COCO dataset you want to download. - -Take coco2017 dataset as an example, specify `/path/to/coco2017` to your COCO path in later training process, the unzipped dataset path structure sholud look like: +Take coco2017 dataset as an example, specify `/path/to/coco2017` to your COCO path in later training process, the +unzipped dataset path structure sholud look like: ```bash coco2017 @@ -49,7 +37,26 @@ coco2017 └── ... ``` -## Step 3: Training +### Install Dependencies + +```bash +# install libGL +yum install mesa-libGL + +# install zlib +wget http://www.zlib.net/fossils/zlib-1.2.9.tar.gz +tar xvf zlib-1.2.9.tar.gz +cd zlib-1.2.9/ +./configure && make install +cd .. +rm -rf zlib-1.2.9.tar.gz zlib-1.2.9/ + +# install requirements +pip3 install seaborn pandas pycocotools matplotlib +pip3 install easydict tensorboardX opencv-python +``` + +## Model Training ```bash # create soft link to coco @@ -60,10 +67,12 @@ ln -s /path/to/coco2017 /home/datasets/cv/coco bash ./scripts/trainval/train.sh ./configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml 1 ``` -## Results -| GPUs | FPS | ACC | -| ---- | ---- | ---- | -| BI-V100 x8 | 1.71s/it | acc: 0.8429 | +## Model Results + +| Model | GPU | FPS | ACC | +|-----------|------------|----------|-------------| +| AlphaPose | BI-V100 x8 | 1.71s/it | acc: 0.8429 | + +## References -## Reference - [AlphaPose](https://github.com/MVIG-SJTU/AlphaPose) diff --git a/cv/pose/hrnet/paddlepaddle/README.md b/cv/pose/hrnet/paddlepaddle/README.md index 55bffd1eb..5bb76c392 100644 --- a/cv/pose/hrnet/paddlepaddle/README.md +++ b/cv/pose/hrnet/paddlepaddle/README.md @@ -1,26 +1,24 @@ # HRNet-W32 -## Model description -HRNet (High-Resolution Net) is proposed for the task of 2D human pose estimation (Human Pose Estimation or Keypoint Detection), and the network is mainly aimed at the pose assessment of a single individual. Most existing human pose estimation methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, HRNet maintains high-resolution representations through the whole process. HRNet starts from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. Then, HRNet conducts repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. +## Model Description +HRNet, or High-Resolution Net, is a general purpose convolutional neural network for tasks like semantic segmentation, +object detection and image classification. It is able to maintain high resolution representations through the whole +process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams +one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several stages and +the nth stage contains n streams corresponding to n resolutions. The authors conduct repeated multi-resolution fusions +by exchanging the information across the parallel streams over and over. -## Step 1: Installation -``` -git clone https://github.com/PaddlePaddle/PaddleDetection.git -``` +## Model Preparation -``` -cd PaddleDetection -pip3 install numba==0.56.4 -pip3 install -r requirements.txt -python3 setup.py install -``` +### Prepare Resources -## Step 2: Preparing Datasets -``` +```bash python3 dataset/coco/download_coco.py ``` + The coco2017 dataset path structure should look like: + ```bash coco2017 ├── annotations @@ -39,31 +37,40 @@ coco2017 ├── val2017.txt └── ... ``` -## Step 3: Training -single GPU +### Install Dependencies + +```bash +git clone https://github.com/PaddlePaddle/PaddleDetection.git + +cd PaddleDetection +pip3 install numba==0.56.4 +pip3 install -r requirements.txt +python3 setup.py install ``` + +## Model Training + +```bash +# single GPU export CUDA_VISIBLE_DEVICES=0 export FLAGS_cudnn_exhaustive_search=True export FLAGS_cudnn_batchnorm_spatial_persistent=True python3 tools/train.py -c configs/keypoint/hrnet/hrnet_w32_384x288.yml --eval -o use_gpu=true -``` -8 GPU(Distributed Training) -``` +# 8 GPU(Distributed Training) export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export FLAGS_cudnn_exhaustive_search=True export FLAGS_cudnn_batchnorm_spatial_persistent=True python3 -m paddle.distributed.launch --log_dir=./log_hrnet_w32_384x288/ --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/keypoint/hrnet/hrnet_w32_384x288.yml --eval ``` -## Results on BI-V100 -
-| GPU | FP32 | -| ----------- | ---------------------------------------------- | -| 8 cards | BatchSize=64,AP(coco val)=78.4, single GPU FPS= 45 | +## Model Results + +| GPU | FP32 | +|------------|----------------------------------------------------| +| BI-V100 x8 | BatchSize=64,AP(coco val)=78.4, single GPU FPS= 45 | -
+## References -## Reference - [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection) diff --git a/cv/pose/hrnet/pytorch/README.md b/cv/pose/hrnet/pytorch/README.md index d5b86de4e..02a1f0b93 100644 --- a/cv/pose/hrnet/pytorch/README.md +++ b/cv/pose/hrnet/pytorch/README.md @@ -1,16 +1,17 @@ # HRNet -## Model description +## Model Description -HRNet, or High-Resolution Net, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several stages and the nth stage contains n streams corresponding to n resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over. +HRNet, or High-Resolution Net, is a general purpose convolutional neural network for tasks like semantic segmentation, +object detection and image classification. It is able to maintain high resolution representations through the whole +process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams +one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several stages and +the nth stage contains n streams corresponding to n resolutions. The authors conduct repeated multi-resolution fusions +by exchanging the information across the parallel streams over and over. -## Step 1: Installing packages +## Model Preparation -```shell -pip3 install -r requirements.txt -``` - -## Step 2: Preparing datasets +### Prepare Resources Go to visit [COCO official website](https://cocodataset.org/#download), then select the COCO dataset you want to download. @@ -35,35 +36,31 @@ coco2017 └── ... ``` -## Step 3: Training - -### On single GPU +### Install Dependencies ```shell -export COCO_DATASET_PATH=/path/to/coco2017 +pip3 install -r requirements.txt ``` +## Model Training + ```shell -python3 ./tools/train.py --cfg ./configs/coco/w32_512_adam_lr1e-3.yaml --datadir=${COCO_DATASET_PATH} --max_epochs=2 -``` +export COCO_DATASET_PATH=/path/to/coco2017 -### On single GPU (AMP) +# On single GPU +python3 ./tools/train.py --cfg ./configs/coco/w32_512_adam_lr1e-3.yaml --datadir=${COCO_DATASET_PATH} --max_epochs=2 -```shell +# On single GPU (AMP) python3 ./tools/train.py --cfg ./configs/coco/w32_512_adam_lr1e-3.yaml --datadir=${COCO_DATASET_PATH} --max_epochs=2 --amp -``` -### Multiple GPUs on one machine +# Multiple GPUs on one machine -```shell python3 ./tools/train.py --cfg ./configs/coco/w32_512_adam_lr1e-3.yaml --datadir=${COCO_DATASET_PATH} --max_epochs=2 --dist -``` - -### Multiple GPUs on one machine (AMP) -```shell +# Multiple GPUs on one machine (AMP) python3 ./tools/train.py --cfg ./configs/coco/w32_512_adam_lr1e-3.yaml --datadir=${COCO_DATASET_PATH} --max_epochs=2 --amp --dist ``` -## Reference -https://github.com/HRNet/HigherHRNet-Human-Pose-Estimation +## References + +- [HigherHRNet-Human-Pose-Estimation](https://github.com/HRNet/HigherHRNet-Human-Pose-Estimation) diff --git a/cv/pose/openpose/mindspore/README.md b/cv/pose/openpose/mindspore/README.md index 81bb4d2eb..ed0616724 100644 --- a/cv/pose/openpose/mindspore/README.md +++ b/cv/pose/openpose/mindspore/README.md @@ -1,29 +1,22 @@ -# Openpose +# OpenPose -## Model description +## Model Description -Openpose network proposes a bottom-up human attitude estimation algorithm using Part Affinity Fields (PAFs). Instead of a top-down algorithm: Detect people first and then return key-points and skeleton. The advantage of openpose is that the computing time does not increase significantly as the number of people in the image increases.However,the top-down algorithm is based on the detection result, and the runtimes grow linearly with the number of people. +OpenPose is a real-time multi-person 2D pose estimation model that uses a bottom-up approach with Part Affinity Fields +(PAFs) to detect human body keypoints and their connections. Unlike top-down methods, OpenPose's computational +efficiency remains stable regardless of the number of people in an image. It simultaneously detects body parts and +associates them to individuals, making it particularly effective for scenarios with multiple people, such as crowd +analysis and human-computer interaction applications. -[Paper](https://arxiv.org/abs/1611.08050): Zhe Cao,Tomas Simon,Shih-En Wei,Yaser Sheikh,"Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields",The IEEE Conference on Computer Vision and Pattern Recongnition(CVPR),2017 +## Model Preparation -## Step 1:Installation +### Prepare Resources -``` -# Pip the requirements -pip3 install -r requirements.txt -wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.7.tar.gz -tar xf openmpi-4.0.7.tar.gz -cd openmpi-4.0.7/ -./configure --prefix=/usr/local/bin --with-orte -make -j4 && make install -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ -``` - -## Step 2:Preparing datasets - -* Go to visit [COCO official website](https://gitee.com/link?target=https%3A%2F%2Fcocodataset.org%2F%23download), then select the COCO dataset you want to download. Take coco2017 dataset as an example, specify `/path/to/coco2017` to your COCO path in later training process, the unzipped dataset path structure sholud look like: +- Go to visit [COCO official website](https://gitee.com/link?target=https%3A%2F%2Fcocodataset.org%2F%23download), then + select the COCO dataset you want to download. Take coco2017 dataset as an example, specify `/path/to/coco2017` to your + COCO path in later training process, the unzipped dataset path structure sholud look like: - ``` +```bash coco2017 ├── annotations │ ├── instances_train2017.json @@ -40,15 +33,17 @@ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ ├── train2017.txt ├── val2017.txt └── ... - ``` -* Create the mask dataset. Run python gen_ignore_mask.py +``` + +- Create the mask dataset. Run python gen_ignore_mask.py + +```bash +python3 ./src/gen_ignore_mask.py --train_ann ./coco2017/annotations/person_keypoints_train2017.json --val_ann ./coco2017/annotations/person_keypoints_val2017.json --train_dir ./coco2017/train2017 --val_dir ./coco2017/val2017 +``` - ``` - python3 ./src/gen_ignore_mask.py --train_ann ./coco2017/annotations/person_keypoints_train2017.json --val_ann ./coco2017/annotations/person_keypoints_val2017.json --train_dir ./coco2017/train2017 --val_dir ./coco2017/val2017 - ``` -* The dataset folder is generated in the root directory and contains the following files: +- The dataset folder is generated in the root directory and contains the following files: - ``` +```bash ├── coco2017 ├── annotations ├─ person_keypoints_train2017.json @@ -58,22 +53,36 @@ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ ├─ train2017 ├─ val2017 └─ ... - ``` -* Download the VGG19 model of the MindSpore version: +``` + +- Download the VGG19 model of the MindSpore version: - [vgg19-0-97_5004.ckpt](https://download.mindspore.cn/model_zoo/converted_pretrained/vgg/vgg19-0-97_5004.ckpt) +[vgg19-0-97_5004.ckpt](https://download.mindspore.cn/model_zoo/converted_pretrained/vgg/vgg19-0-97_5004.ckpt) -## Step 3:Training +### Install Dependencies + +```bash +# Pip the requirements +pip3 install -r requirements.txt +wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.7.tar.gz +tar xf openmpi-4.0.7.tar.gz +cd openmpi-4.0.7/ +./configure --prefix=/usr/local/bin --with-orte +make -j4 && make install +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ +``` + +## Model Training Change the absolute path of the data in running shell `train_openpose_coco2017_1card.sh` `train_openpose_coco2017_8card.sh`. For example in `train_openpose_coco2017_1card.sh`: -``` +```bash bash scripts/run_standalone_train.sh /home/coco2017/train2017 /home/coco2017/annotations/person_keypoints_train2017.json /home/coco2017/ignore_mask_train /home/vgg19-0-97_5004.ckpt ``` -``` +```bash # Run on 1 GPU bash train_openpose_coco2017_1card.sh @@ -84,12 +93,12 @@ bash train_openpose_coco2017_8card.sh python3 eval.py --model_path /home/openpose_train_8gpu_ckpt/0-80_663.ckpt --imgpath_val coco2017/val2017 --ann coco2017/annotations/person_keypoints_val2017.json ``` -## Results +## Model Results | GPUS | AP | AP  .5 | AR | AR  .5 | -| ---------- | ------ | ------- | ------ | ------- | -| BI V100×8 | 0.3979 | 0.6654 | 0.4435 | 0.6889 | +|------------|--------|--------|--------|--------| +| BI-V100 ×8 | 0.3979 | 0.6654 | 0.4435 | 0.6889 | -## Reference +## References -[Openpose](https://gitee.com/mindspore/models/tree/master/official/cv/OpenPose) +- [Openpose](https://gitee.com/mindspore/models/tree/master/official/cv/OpenPose) diff --git a/cv/self_supervised_learning/mae/pytorch/README.md b/cv/self_supervised_learning/mae/pytorch/README.md index ff57584ad..685f2d1b2 100644 --- a/cv/self_supervised_learning/mae/pytorch/README.md +++ b/cv/self_supervised_learning/mae/pytorch/README.md @@ -1,43 +1,50 @@ -# MAE-pytorch +# MAE -## Model description -This repository is built upon BEiT, an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners. We implement the pretrain and finetune process according to the paper, but still can't guarantee the performance reported in the paper can be reproduced! +## Model Description -## Environment +MAE (Masked Autoencoders) is a self-supervised learning model for computer vision that learns powerful representations +by reconstructing randomly masked portions of input images. It employs an asymmetric encoder-decoder architecture where +the encoder processes only the visible patches, and the lightweight decoder reconstructs the original image from the +latent representation and mask tokens. MAE demonstrates that high masking ratios (e.g., 75%) can lead to robust feature +learning, making it scalable and effective for various downstream vision tasks. -``` -pip3 install -r requirements.txt -mkdir -p /home/datasets/cv/ImageNet_ILSVRC2012 -mkdir -p pretrain -mkdir -p output -``` +## Model Preparation -## Download dataset +### Prepare Resources -``` +Download dataset. + +```bash cd /home/datasets/cv/ImageNet_ILSVRC2012 Download the [ImageNet Dataset](https://www.image-net.org/download.php) ``` -## Download pretrain weight +Download pretrain weight -``` +```bash cd pretrain Download the [pretrain_mae_vit_base_mask_0.75_400e.pth](https://drive.google.com/drive/folders/182F5SLwJnGVngkzguTelja4PztYLTXfa) ``` -## Finetune +### Install Dependencies +```bash +pip3 install -r requirements.txt +mkdir -p /home/datasets/cv/ImageNet_ILSVRC2012 +mkdir -p pretrain +mkdir -p output ``` + +## Model Training + +```bash +# Finetune cd .. bash run.sh ``` -## Results on BI-V100 - -``` -| GPUs | FPS | Train Epochs | Accuracy | -|------|-----|--------------|------| -| 1x8 | 1233 | 100 | 82.9% | -``` +## Model Results +| GPU | FPS | Train Epochs | Accuracy | +|------------|------|--------------|----------| +| BI-V100 x8 | 1233 | 100 | 82.9% | -- Gitee From 660d96efcb1c8b7e767d92eac7172723f7040f51 Mon Sep 17 00:00:00 2001 From: "mingjiang.li" Date: Mon, 10 Mar 2025 15:48:51 +0800 Subject: [PATCH 2/9] unify model readme format - cv/ocr --- cv/ocr/README.md | 1 - cv/ocr/crnn/mindspore/README.md | 90 +++++++++++-------- cv/ocr/crnn/paddlepaddle/README.md | 34 ++++--- cv/ocr/dbnet/pytorch/README.md | 80 ++++++++++------- .../pytorch/configs/textdet/dbnet/README.md | 2 +- cv/ocr/dbnetpp/paddlepaddle/README.md | 61 +++++++------ cv/ocr/dbnetpp/pytorch/README.md | 39 ++++---- cv/ocr/pp-ocr-db/paddlepaddle/README.md | 29 ++++-- cv/ocr/pp-ocr-east/paddlepaddle/README.md | 44 +++++---- cv/ocr/pse/paddlepaddle/README.md | 30 ++++--- cv/ocr/sar/pytorch/README.md | 49 +++++----- cv/ocr/sast/paddlepaddle/README.md | 44 +++++---- cv/ocr/satrn/pytorch/base/README.md | 78 ++++++++-------- cv/point_cloud/point-bert/pytorch/README.md | 76 +++++++++------- 14 files changed, 376 insertions(+), 281 deletions(-) delete mode 100644 cv/ocr/README.md diff --git a/cv/ocr/README.md b/cv/ocr/README.md deleted file mode 100644 index c92aa3bdd..000000000 --- a/cv/ocr/README.md +++ /dev/null @@ -1 +0,0 @@ -# Optical Character Recognition diff --git a/cv/ocr/crnn/mindspore/README.md b/cv/ocr/crnn/mindspore/README.md index 7ddd2025b..649384425 100644 --- a/cv/ocr/crnn/mindspore/README.md +++ b/cv/ocr/crnn/mindspore/README.md @@ -1,12 +1,51 @@ # CRNN -## Model description +## Model Description -CRNN was a neural network for image based sequence recognition and its Application to scene text recognition.In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. +CRNN (Convolutional Recurrent Neural Network) is an end-to-end trainable model for image-based sequence recognition, +particularly effective for scene text recognition. It combines convolutional layers for feature extraction with +recurrent layers for sequence modeling, followed by a transcription layer. CRNN handles sequences of arbitrary lengths +without character segmentation or horizontal scaling, making it versatile for both lexicon-free and lexicon-based text +recognition tasks. Its compact architecture and unified framework make it practical for real-world applications like +document analysis and OCR. -[Paper](https://arxiv.org/abs/1507.05717): Baoguang Shi, Xiang Bai, Cong Yao, "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition", ArXiv, vol. abs/1507.05717, 2015. +## Model Preparation -## Step 1:Installation +### Prepare Resources + +- Go to visit [Syn90K official website](https://www.robots.ox.ac.uk/~vgg/data/text/), then download the dataset for + training. The dataset path structure sholud look like: + +```bash + ├── Syn90k + │ ├── shuffle_labels.txt + │ ├── label.txt + │ ├── label.lmdb + │ ├── mnt +``` + +- Go to visit [IIIT5K official + website](https://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset), then download the dataset + for test. The dataset path structure sholud look like: + +```bash + ├── IIIT5K + │ ├── traindata.mat + │ ├── testdata.mat + │ ├── trainCharBound.mat + │ ├── testCharBound.mat + │ ├── lexicon.txt + │ ├── train + │ ├── test +``` + +- The annotation need to be extracted from the matlib data file. + +```bash +python3 convert_iiit5k.py -m ./IIIT5K/testdata.mat -o ./IIIT5K -a ./IIIT5K/annotation.txt +``` + +### Install Dependencies ```shell # Install requirements @@ -21,34 +60,7 @@ make -j4 && make install export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ ``` -## Step 2:Preparing datasets - -* Go to visit [Syn90K official website](https://www.robots.ox.ac.uk/~vgg/data/text/), then download the dataset for training. The dataset path structure sholud look like: - - ``` - ├── Syn90k - │ ├── shuffle_labels.txt - │ ├── label.txt - │ ├── label.lmdb - │ ├── mnt - ``` -* Go to visit [IIIT5K official website](https://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset), then download the dataset for test. The dataset path structure sholud look like: - -``` -├── IIIT5K -│ ├── traindata.mat -│ ├── testdata.mat -│ ├── trainCharBound.mat -│ ├── testCharBound.mat -│ ├── lexicon.txt -│ ├── train -│ ├── test -``` -* The annotation need to be extracted from the matlib data file. -``` -python3 convert_iiit5k.py -m ./IIIT5K/testdata.mat -o ./IIIT5K -a ./IIIT5K/annotation.txt -``` -## Step 3:Training +## Model Training ```bash # Run on 1 GPU @@ -60,17 +72,17 @@ python3 train.py --train_dataset=synth --train_dataset_path=./Syn90k/mnt/ramdisk # Run eval python3 eval.py --eval_dataset=iiit5k \ ---eval_dataset_path=./IIIT5K/ \ +--eval_dataset_path=./IIIT5K/ \ --checkpoint_path=./ckpt_0/crnn-10_14110.ckpt \ --device_target=GPU 2>&1 | tee eval.log ``` -## Results +## Model Results -| GPUS | DATASETS | ACC | FPS | -| ---------- | ---------- | ------ | ------ | -| BI-V100 x8 | IIIT5K | 0.798 | 7976.44 | +| GPUS | DATASETS | ACC | FPS | +|------------|----------|-------|---------| +| BI-V100 x8 | IIIT5K | 0.798 | 7976.44 | -## Reference +## References -[CRNN](https://gitee.com/mindspore/models/tree/master/official/cv/CRNN) +- [CRNN](https://gitee.com/mindspore/models/tree/master/official/cv/CRNN) diff --git a/cv/ocr/crnn/paddlepaddle/README.md b/cv/ocr/crnn/paddlepaddle/README.md index 43f568a35..6c9b8b167 100644 --- a/cv/ocr/crnn/paddlepaddle/README.md +++ b/cv/ocr/crnn/paddlepaddle/README.md @@ -1,28 +1,40 @@ # CRNN +## Model Description -## Step 1: Installing -``` +CRNN (Convolutional Recurrent Neural Network) is an end-to-end trainable model for image-based sequence recognition, +particularly effective for scene text recognition. It combines convolutional layers for feature extraction with +recurrent layers for sequence modeling, followed by a transcription layer. CRNN handles sequences of arbitrary lengths +without character segmentation or horizontal scaling, making it versatile for both lexicon-free and lexicon-based text +recognition tasks. Its compact architecture and unified framework make it practical for real-world applications like +document analysis and OCR. + +## Model Preparation + +### Prepare Resources + +Download [data_lmdb_release](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here). + +### Install Dependencies + +```bash git clone https://github.com/PaddlePaddle/PaddleOCR.git -``` -``` -cd PaddleOCR +cd PaddleOCR/ pip3 install -r requirements.txt ``` -## Step 2: Prepare Datasets -Download [data_lmdb_release](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here). +## Model Training -## Step 3: Training Notice: modify configs/rec/rec_mv3_none_bilstm_ctc.yml file, modify the datasets path as yours. -``` -cd PaddleOCR + +```bash export FLAGS_cudnn_exhaustive_search=True export FLAGS_cudnn_batchnorm_spatial_persistent=True export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -u -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/rec/rec_mv3_none_bilstm_ctc.yml Global.use_visualdl=True ``` -## Reference +## References + - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) diff --git a/cv/ocr/dbnet/pytorch/README.md b/cv/ocr/dbnet/pytorch/README.md index 6f0bf3b73..8202dc43e 100755 --- a/cv/ocr/dbnet/pytorch/README.md +++ b/cv/ocr/dbnet/pytorch/README.md @@ -1,16 +1,33 @@ # DBNet -## Model description -Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. -## Step 2: Preparing datasets +## Model Description + +Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more +accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is +essential for segmentation-based detection, which converts probability maps produced by a segmentation method into +bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can +perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can +adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the +performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on +five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and +speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can +look for an ideal tradeoff between detection accuracy and efficiency. + +## Model Preparation + +### Prepare Resources ```shell -$ mkdir data -$ cd data +mkdir data +cd data ``` -ICDAR 2015 -Please [ICDAR 2015](https://rrc.cvc.uab.es/?ch=4&com=downloads) download ICDAR 2015 here -ch4_training_images.zip、ch4_test_images.zip、ch4_training_localization_transcription_gt.zip、Challenge4_Test_Task1_GT.zip + +Download [ICDAR 2015](https://rrc.cvc.uab.es/?ch=4&com=downloads). + +- ch4_training_images.zip +- ch4_test_images.zip +- ch4_training_localization_transcription_gt.zip +- Challenge4_Test_Task1_GT.zip ```shell mkdir icdar2015 && cd icdar2015 @@ -22,49 +39,46 @@ mv ch4_test_images imgs/test mv ch4_training_localization_transcription_gt annotations/training mv Challenge4_Test_Task1_GT annotations/test ``` -Please [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) download instances_training.json here -Please [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) download instances_test.json here -```shell +Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json). +Download [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json). + +```shell icdar2015/ ├── imgs │ ├── test │ └── training ├── instances_test.json └── instances_training.json - ``` -### Build Extension -```shell -$ DBNET_CV_WITH_OPS=1 python3 setup.py build && cp build/lib.linux*/dbnet_cv/_ext.cpython* dbnet_cv -``` -### Install packages +### Install Dependencies ```shell -$ pip3 install -r requirements.txt -``` +# Build Extension +DBNET_CV_WITH_OPS=1 python3 setup.py build && cp build/lib.linux*/dbnet_cv/_ext.cpython* dbnet_cv -### Training on single card -```shell -$ python3 train.py configs/textdet/dbnet/dbnet_mobilenetv3_fpnc_1200e_icdar2015.py +# Install packages +pip3 install -r requirements.txt ``` -### Training on mutil-cards +## Model Training + ```shell -$ bash dist_train.sh configs/textdet/dbnet/dbnet_mobilenetv3_fpnc_1200e_icdar2015.py 8 +# Training on single card +python3 train.py configs/textdet/dbnet/dbnet_mobilenetv3_fpnc_1200e_icdar2015.py + +# Training on mutil-cards +bash dist_train.sh configs/textdet/dbnet/dbnet_mobilenetv3_fpnc_1200e_icdar2015.py 8 ``` -## Results on BI-V100 +## Model Results -| approach| GPUs | train mem | train FPS | -| :-----: |:-------:| :-------: |:--------: | -| dbnet | BI100x8 | 5426 | 54.375 | +| Model | GPUs | train mem | train FPS | 0_hmean-iou:recall | 0_hmean-iou:precision | 0_hmean-iou:hmean | +|-------|------------|-----------|-----------|--------------------|-----------------------|-------------------| +| DBNet | BI-V100 x8 | 5426 | 54.375 | 0.7111 | 0.8062 | 0.7557 | -|0_hmean-iou:recall: | 0_hmean-iou:precision: | 0_hmean-iou:hmean:| -| :-----: | :-------: | :-------: | -| 0.7111 | 0.8062 | 0.7557 | +## References -## Reference -https://github.com/open-mmlab/mmocr +- [mmocr](https://github.com/open-mmlab/mmocr)SSS diff --git a/cv/ocr/dbnet/pytorch/configs/textdet/dbnet/README.md b/cv/ocr/dbnet/pytorch/configs/textdet/dbnet/README.md index d2007c72e..ac7a1be78 100755 --- a/cv/ocr/dbnet/pytorch/configs/textdet/dbnet/README.md +++ b/cv/ocr/dbnet/pytorch/configs/textdet/dbnet/README.md @@ -12,7 +12,7 @@ Recently, segmentation-based methods are quite popular in scene text detection, -## Results and models +## Model Results ### ICDAR2015 diff --git a/cv/ocr/dbnetpp/paddlepaddle/README.md b/cv/ocr/dbnetpp/paddlepaddle/README.md index fec54ed01..ad038cda1 100644 --- a/cv/ocr/dbnetpp/paddlepaddle/README.md +++ b/cv/ocr/dbnetpp/paddlepaddle/README.md @@ -1,33 +1,22 @@ # DBNet++ -## Model description +## Model Description -Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks. +DBNet++ is an advanced scene text detection model that combines a Differentiable Binarization (DB) module with an +Adaptive Scale Fusion (ASF) mechanism. The DB module integrates binarization directly into the segmentation network, +simplifying post-processing and improving accuracy. The ASF module enhances scale robustness by adaptively fusing +multi-scale features. This architecture enables DBNet++ to detect text of arbitrary shapes and extreme aspect ratios +efficiently, achieving state-of-the-art performance in both accuracy and speed across various text detection benchmarks. -## Step 1: Installation +## Model Preparation -```bash -# Clone PaddleOCR, branch: release/2.5 -git clone -b release/2.5 https://github.com/PaddlePaddle/PaddleOCR.git - -# Copy PaddleOCR 2.5 patch from toolbox -yes | cp -rf ../../../../toolbox/PaddleOCR/* PaddleOCR/ -cd PaddleOCR - -# install requirements. -bash ../init.sh - -# build PaddleOCR -python3 setup.py develop -``` +### Prepare Resources -## Step 2: Preparing datasets - -Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) +Download [ICDAR 2015](https://deepai.org/dataset/icdar-2015) Dataset. ```bash # ICDAR2015 PATH as follow: -ls -al /home/datasets/ICDAR2015/text_localization +$ ls -al /home/datasets/ICDAR2015/text_localization total 133420 drwxr-xr-x 4 root root 179 Jul 21 15:54 . drwxr-xr-x 3 root root 39 Jul 21 15:50 .. @@ -46,7 +35,24 @@ ln -s /path/to/icdar2015/ train_data/icdar2015 wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/pretrained/MobileNetV3_large_x0_5_pretrained.pdparams ``` -## Step 3: Training +### Install Dependencies + +```bash +# Clone PaddleOCR, branch: release/2.5 +git clone -b release/2.5 https://github.com/PaddlePaddle/PaddleOCR.git + +# Copy PaddleOCR 2.5 patch from toolbox +yes | cp -rf ../../../../toolbox/PaddleOCR/* PaddleOCR/ +cd PaddleOCR + +# install requirements. +bash ../init.sh + +# build PaddleOCR +python3 setup.py develop +``` + +## Model Training ```bash # run training @@ -59,13 +65,12 @@ python3 -m paddle.distributed.launch --gpus $CUDA_VISIBLE_DEVICES \ -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained ``` -## Results - +## Model Results -| GPUs | IPS | ACC | -| ------------ | ---------------- | ------------------- | -| BI-V100 x8 | 5.46 samples/s | precision: 0.9062 | +| Model | GPUs | IPS | ACC | +|---------|------------|----------------|-------------------| +| DBNet++ | BI-V100 x8 | 5.46 samples/s | precision: 0.9062 | -## Reference +## References - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR.git) diff --git a/cv/ocr/dbnetpp/pytorch/README.md b/cv/ocr/dbnetpp/pytorch/README.md index 244dc8284..ece02b7e8 100644 --- a/cv/ocr/dbnetpp/pytorch/README.md +++ b/cv/ocr/dbnetpp/pytorch/README.md @@ -1,10 +1,23 @@ # DBNet++ -## Model description +## Model Description -Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks. +DBNet++ is an advanced scene text detection model that combines a Differentiable Binarization (DB) module with an +Adaptive Scale Fusion (ASF) mechanism. The DB module integrates binarization directly into the segmentation network, +simplifying post-processing and improving accuracy. The ASF module enhances scale robustness by adaptively fusing +multi-scale features. This architecture enables DBNet++ to detect text of arbitrary shapes and extreme aspect ratios +efficiently, achieving state-of-the-art performance in both accuracy and speed across various text detection benchmarks. -## Step 1: Installation +## Model Preparation + +### Prepare Resources + +```bash +mkdir data +python3 tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet +``` + +### Install Dependencies ```bash # Install libGL @@ -26,14 +39,7 @@ mkdir -p /root/.cache/torch/hub/checkpoints/ wget https://download.pytorch.org/models/resnet50-0676ba61.pth -O /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth ``` -## Step 2: Preparing datasets - -```bash -mkdir data -python3 tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet -``` - -## Step 3: Training +## Model Training ```bash sed -i 's/val_interval=20/val_interval=1200/g' configs/textdet/_base_/schedules/schedule_sgd_1200e.py @@ -47,12 +53,13 @@ python3 tools/train.py configs/textdet/dbnetpp/dbnetpp_resnet50_fpnc_1200e_icdar # Multiple GPUs on one machine bash tools/dist_train.sh configs/textdet/dbnetpp/dbnetpp_resnet50_fpnc_1200e_icdar2015.py 8 ``` -## Results -| GPUs | Precision | Recall | Hmean | -| ---------- | --------- | ------ | ----- | -| BI-V100 x8 | 0.8823 | 0.8156 | 0.8476 | +## Model Results + +| Model | GPU | Precision | Recall | Hmean | +|---------|------------|-----------|--------|--------| +| DBNet++ | BI-V100 x8 | 0.8823 | 0.8156 | 0.8476 | -## Reference +## References - [mmocr](https://github.com/open-mmlab/mmocr/tree/v1.0.1/configs/textdet/dbnetpp) diff --git a/cv/ocr/pp-ocr-db/paddlepaddle/README.md b/cv/ocr/pp-ocr-db/paddlepaddle/README.md index 15ecf8377..0d1b59ca9 100644 --- a/cv/ocr/pp-ocr-db/paddlepaddle/README.md +++ b/cv/ocr/pp-ocr-db/paddlepaddle/README.md @@ -1,15 +1,18 @@ # PP-OCR-DB -## Step 1: Installing -```bash -git clone --recursive https://github.com/PaddlePaddle/PaddleOCR.git -cd PaddleOCR -pip3 install -r requirements.txt -``` +## Model Description + +PP-OCR-DB is an efficient deep learning model for text detection, part of the PaddleOCR framework. It combines a +MobileNetV3 backbone with a Differentiable Binarization (DB) module to accurately detect text in various scenarios. The +model is optimized for real-time performance and can handle diverse text layouts and orientations. PP-OCR-DB is +particularly effective in document analysis and scene text recognition tasks, offering a balance between accuracy and +computational efficiency for practical OCR applications. + +## Model Preparation -## Step 2: Download data +### Prepare Resources -Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) +Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) ```bash # ICDAR2015 PATH as follow: @@ -26,7 +29,15 @@ drwxr-xr-x 2 root root 24576 Jul 21 15:53 icdar_c4_train_imgs ``` -## Step 3: Run PP-OCR-DB +### Install Dependencies + +```bash +git clone --recursive https://github.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR +pip3 install -r requirements.txt +``` + +## Model Training ```bash # Notice: modify "configs/det/det_mv3_db.yml" file, set the datasets path as yours. diff --git a/cv/ocr/pp-ocr-east/paddlepaddle/README.md b/cv/ocr/pp-ocr-east/paddlepaddle/README.md index a3acde4f4..d842f8308 100644 --- a/cv/ocr/pp-ocr-east/paddlepaddle/README.md +++ b/cv/ocr/pp-ocr-east/paddlepaddle/README.md @@ -1,23 +1,22 @@ # PP-OCR-EAST -## Model description -EAST (Efficient and Accurate Scene Text Detector) is a deep learning model designed for detecting and recognizing text in natural scene images. -It was developed by researchers at the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, and was presented in a research paper in 2017. +## Model Description -## Step 1: Installation -```bash -git clone --recursive https://github.com/PaddlePaddle/PaddleOCR.git -cd PaddleOCR -pip3 install -r requirements.txt -``` +PP-OCR-EAST is an efficient scene text detection model based on the EAST architecture, optimized within the PaddleOCR +framework. It combines a MobileNetV3 backbone with the EAST detection mechanism to accurately locate text in natural +scene images. The model is designed for real-time performance and can handle text of various orientations and sizes. +PP-OCR-EAST is particularly effective in complex scenarios, offering a balance between detection accuracy and +computational efficiency for practical OCR applications. + +## Model Preparation -## Step 2: Preparing datasets +### Prepare Resources -Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) +Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) ```bash # ICDAR2015 PATH as follow: -ls -al /home/datasets/ICDAR2015/text_localization +$ ls -al /home/datasets/ICDAR2015/text_localization total 133420 drwxr-xr-x 4 root root 179 Jul 21 15:54 . drwxr-xr-x 3 root root 39 Jul 21 15:50 .. @@ -30,7 +29,15 @@ drwxr-xr-x 2 root root 24576 Jul 21 15:53 icdar_c4_train_imgs ``` -## Step 3: Training +### Install Dependencies + +```bash +git clone --recursive https://github.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR +pip3 install -r requirements.txt +``` + +## Model Training ```bash # Notice: modify "configs/det/det_mv3_east.yml" file, set the datasets path as yours. @@ -41,11 +48,12 @@ export CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/det/det_mv3_east.yml -o Global.use_visualdl=True ``` -## Results +## Model Results + +| Model | GPU | FPS | ACC | +|-------------|------------|-------|---------------------------------| +| PP-OCR-EAST | BI-V100 x8 | 50.08 | hmean:0.7711, precision: 0.7752 | -GPUs|FPS|ACC -----|---|--- -BI-V100 x8|50.08|hmean:0.7711, precision: 0.7752 +## References -## Reference - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR.git) diff --git a/cv/ocr/pse/paddlepaddle/README.md b/cv/ocr/pse/paddlepaddle/README.md index 18e6078e3..413c7f087 100644 --- a/cv/ocr/pse/paddlepaddle/README.md +++ b/cv/ocr/pse/paddlepaddle/README.md @@ -1,21 +1,20 @@ # PSE -## Model description -[Shape robust text detection with progressive scale expansion network](https://arxiv.org/abs/1903.12473) Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai CVPR, 2019 +## Model Description -## Step 1: Installing -```bash -git clone --recursive https://github.com/PaddlePaddle/PaddleOCR.git -cd PaddleOCR -pip3 install -r requirements.txt -``` +PSE (Progressive Scale Expansion Network) is a deep learning model for robust text detection in natural scenes. It +addresses the challenge of detecting text with arbitrary shapes by progressively expanding text regions through a scale +expansion algorithm. PSE effectively handles complex scenarios like curved text and overlapping instances. The model's +architecture combines feature pyramid networks with a novel post-processing method, making it particularly suitable for +detecting text in diverse orientations and layouts with high accuracy. -## Step 2: Download data +## Model Preparation -Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) +### Prepare Resources -```bash +Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) +```bash # ICDAR2015 PATH as follow: ls -al /home/datasets/ICDAR2015/text_localization total 133420 @@ -27,10 +26,17 @@ drwxr-xr-x 2 root root 12288 Jul 21 15:53 ch4_test_images drwxr-xr-x 2 root root 24576 Jul 21 15:53 icdar_c4_train_imgs -rw-r--r-- 1 root root 468453 Jul 21 15:54 test_icdar2015_label.txt -rw-r--r-- 1 root root 1063118 Jul 21 15:54 train_icdar2015_label.txt +``` + +### Install Dependencies +```bash +git clone --recursive https://github.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR +pip3 install -r requirements.txt ``` -## Step 3: Run PSE +## Model Training ```bash # Notice: modify "configs/det/det_r50_vd_pse.yml" file, set the datasets path as yours. diff --git a/cv/ocr/sar/pytorch/README.md b/cv/ocr/sar/pytorch/README.md index 06a7dee28..cee369f63 100755 --- a/cv/ocr/sar/pytorch/README.md +++ b/cv/ocr/sar/pytorch/README.md @@ -1,28 +1,26 @@ # SAR -## Model description +## Model Description -Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using off-the-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTM-based encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust and achieves state-of-the-art performance on both regular and irregular scene text recognition benchmarks. +Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as +curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra +fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data +collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using +off-the-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an +LSTM-based encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method +is robust and achieves state-of-the-art performance on both regular and irregular scene text recognition benchmarks. -## Step 1: Installation +## Model Preparation -```shell -cd csrc/ -bash clean.sh -bash build.sh -bash install.sh -cd .. -pip3 install -r requirements.txt -``` - -## Step 2: Preparing datasets +### Prepare Resources ```bash mkdir data cd data ``` -Reffering to [MMOCR Docs](https://mmocr.readthedocs.io/zh_CN/dev-1.x/user_guides/data_prepare/datasetzoo.html) to prepare datasets. Datasets path would look like below: +Reffering to [MMOCR Docs](https://mmocr.readthedocs.io/zh_CN/dev-1.x/user_guides/data_prepare/datasetzoo.html) to +prepare datasets. Datasets path would look like below: ```bash ├── mixture @@ -95,18 +93,27 @@ Reffering to [MMOCR Docs](https://mmocr.readthedocs.io/zh_CN/dev-1.x/user_guides │ │ ├── val_label.txt ``` -## Step 3: Training +### Install Dependencies -### Training on single card -```bash -python3 train.py configs/sar_r31_parallel_decoder_academic.py +```shell +cd csrc/ +bash clean.sh +bash build.sh +bash install.sh +cd .. +pip3 install -r requirements.txt ``` -### Training on mutil-cards +## Model Training + ```bash +# Training on single card +python3 train.py configs/sar_r31_parallel_decoder_academic.py + +# Training on mutil-cards bash dist_train.sh configs/sar_r31_parallel_decoder_academic.py 8 ``` -## Reference -https://github.com/open-mmlab/mmocr +## References +- [mmocr](https://github.com/open-mmlab/mmocr) diff --git a/cv/ocr/sast/paddlepaddle/README.md b/cv/ocr/sast/paddlepaddle/README.md index db4db6670..f98f954b7 100644 --- a/cv/ocr/sast/paddlepaddle/README.md +++ b/cv/ocr/sast/paddlepaddle/README.md @@ -1,20 +1,21 @@ # SAST -## Description +## Model Description -SAST is a cutting-edge segmentation-based text detector designed for recognizing scene text of arbitrary shapes. Leveraging a context attended multi-task learning framework anchored on a Fully Convolutional Network (FCN), it adeptly learns geometric properties to reconstruct text regions into polygonal shapes. Incorporating a Context Attention Block, SAST captures long-range pixel dependencies for improved segmentation accuracy, while its Point-to-Quad assignment method efficiently clusters pixels into text instances by merging high-level and low-level information. Demonstrated to be highly effective across several benchmarks like ICDAR2015 and SCUT-CTW1500, SAST not only shows superior accuracy but also operates efficiently, achieving significant performance metrics such as running at 27.63 FPS on a NVIDIA Titan Xp with a high detection accuracy, making it a notable solution for arbitrary-shaped text detection challenges. +SAST is a cutting-edge segmentation-based text detector designed for recognizing scene text of arbitrary shapes. +Leveraging a context attended multi-task learning framework anchored on a Fully Convolutional Network (FCN), it adeptly +learns geometric properties to reconstruct text regions into polygonal shapes. Incorporating a Context Attention Block, +SAST captures long-range pixel dependencies for improved segmentation accuracy, while its Point-to-Quad assignment +method efficiently clusters pixels into text instances by merging high-level and low-level information. Demonstrated to +be highly effective across several benchmarks like ICDAR2015 and SCUT-CTW1500, SAST not only shows superior accuracy but +also operates efficiently, achieving significant performance metrics such as running at 27.63 FPS on a NVIDIA Titan Xp +with a high detection accuracy, making it a notable solution for arbitrary-shaped text detection challenges. -## Step 1: Installation +## Model Preparation -```bash -git clone -b release/2.7 https://github.com/PaddlePaddle/PaddleOCR.git -cd PaddleOCR/ -pip3 install -r requirements.txt -``` - -## Step 2: Preparing datasets +### Prepare Resources -Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) +Download the [ICDAR2015 Dataset](https://deepai.org/dataset/icdar-2015) ```bash # ICDAR2015 PATH as follow: @@ -28,10 +29,17 @@ drwxr-xr-x 2 root root 12288 Jul 21 15:53 ch4_test_images drwxr-xr-x 2 root root 24576 Jul 21 15:53 icdar_c4_train_imgs -rw-r--r-- 1 root root 468453 Jul 21 15:54 test_icdar2015_label.txt -rw-r--r-- 1 root root 1063118 Jul 21 15:54 train_icdar2015_label.txt +``` +### Install Dependencies + +```bash +git clone -b release/2.7 https://github.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR/ +pip3 install -r requirements.txt ``` -## Step 3: Training +## Model Training ```bash # Notice: modify "configs/det/det_r50_vd_sast_icdar15.yml" file, set the datasets path as yours. @@ -43,12 +51,12 @@ python3 -u -m paddle.distributed.launch --gpus '0,1,2,3,4,5,6,7' tools/train.py >train.log 2>&1 & ``` -## Results +## Model Results -GPUs|FPS|ACC -----|---|--- -BI-V100 x8| ips: 11.24631 samples/s | hmean: 0.817155756207675 +| Model | GPU | FPS | ACC | +|-------|------------|-------------------------|--------------------------| +| SAST | BI-V100 x8 | ips: 11.24631 samples/s | hmean: 0.817155756207675 | -## Reference -- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR.git) +## References +- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR.git) diff --git a/cv/ocr/satrn/pytorch/base/README.md b/cv/ocr/satrn/pytorch/base/README.md index 29ba40015..f3ce80cdd 100755 --- a/cv/ocr/satrn/pytorch/base/README.md +++ b/cv/ocr/satrn/pytorch/base/README.md @@ -1,35 +1,25 @@ -# SATRN(Self-Attention Text Recognition Network) +# SATRN -## Model description +## Model Description -Scene text recognition (STR) is the task of recognizing character sequences in natural scenes. While there have been great advances in STR methods, current methods still fail to recognize texts in arbitrary shapes, such as heavily curved or rotated texts, which are abundant in daily life (e.g. restaurant signs, product labels, company logos, etc). This paper introduces a novel architecture to recognizing texts of arbitrary shapes, named Self-Attention Text Recognition Network (SATRN), which is inspired by the Transformer. SATRN utilizes the self-attention mechanism to describe two-dimensional (2D) spatial dependencies of characters in a scene text image. Exploiting the full-graph propagation of self-attention, SATRN can recognize texts with arbitrary arrangements and large inter-character spacing. As a result, SATRN outperforms existing STR models by a large margin of 5.7 pp on average in "irregular text" benchmarks. We provide empirical analyses that illustrate the inner mechanisms and the extent to which the model is applicable (e.g. rotated and multi-line text). We will open-source the code. +SATRN (Self-Attention Text Recognition Network) is an advanced deep learning model for scene text recognition, +particularly effective for texts with arbitrary shapes like curved or rotated characters. Inspired by Transformer +architecture, SATRN utilizes self-attention mechanisms to capture 2D spatial dependencies between characters. This +enables it to handle complex text arrangements and large inter-character spacing with high accuracy. SATRN significantly +outperforms traditional methods in recognizing irregular texts, making it valuable for real-world applications like sign +and logo recognition. +## Model Preparation -## Step 1: Installation - -```bash -# Install libGL -## CentOS -yum install -y mesa-libGL -## Ubuntu -apt install -y libgl1-mesa-dev - -cd /satrn/pytorch/base/csrc -bash clean.sh -bash build.sh -bash install.sh -cd .. -pip3 install -r requirements.txt -``` - -## Step 2: Preparing datasets +### Prepare Resources ```bash mkdir data cd data ``` -Reffering to [MMOCR Docs](https://mmocr.readthedocs.io/zh_CN/dev-1.x/user_guides/data_prepare/datasetzoo.html) to prepare datasets. Datasets path would look like below: +Reffering to [MMOCR Docs](https://mmocr.readthedocs.io/zh_CN/dev-1.x/user_guides/data_prepare/datasetzoo.html) to +prepare datasets. Datasets path would look like below: ```bash ├── mixture @@ -102,37 +92,43 @@ Reffering to [MMOCR Docs](https://mmocr.readthedocs.io/zh_CN/dev-1.x/user_guides │ │ ├── val_label.txt ``` -## Step 3: Training +### Install Dependencies -### Training on single card ```bash -python3 train.py configs/models/satrn_academic.py +# Install libGL +## CentOS +yum install -y mesa-libGL +## Ubuntu +apt install -y libgl1-mesa-dev + +cd /satrn/pytorch/base/csrc +bash clean.sh +bash build.sh +bash install.sh +cd .. +pip3 install -r requirements.txt ``` -### Training on mutil-cards +## Model Training + ```bash +# Training on single card +python3 train.py configs/models/satrn_academic.py + +# Training on mutil-cards bash dist_train.sh configs/models/satrn_academic.py 8 ``` -## Results on BI-V100 +## Model Results -| approach| GPUs | train mem | train FPS | -| :-----: |:-------:| :-------: |:--------: | -| satrn | BI100x8 | 14.159G | 549.94 | - -| dataset | acc | -| :-----: |:-------:| -| IIIT5K | 94.5 | -| IC15 | 83.3 | -| SVTP | 88.4 | +| Model | GPU | train mem | train FPS | ACC | +|-------|------------|-----------|-----------|--------------------------------------| +| SATRN | BI-V100 x8 | 14.159G | 549.94 | IIIT5K: 94.5, IC15: 83.3, SVTP: 88.4 | | Convergence criteria | Configuration (x denotes number of GPUs) | Performance | Accuracy | Power(W) | Scalability | Memory utilization(G) | Stability | |----------------------|------------------------------------------|-------------|----------|------------|-------------|-------------------------|-----------| | 0.841 | SDK V2.2,bs:128,8x,fp32 | 630 | 88.4 | 166\*8 | 0.98 | 28.5\*8 | 1 | +## References -## Reference -https://github.com/open-mmlab/mmocr - - - +- [mmocr](https://github.com/open-mmlab/mmocr) diff --git a/cv/point_cloud/point-bert/pytorch/README.md b/cv/point_cloud/point-bert/pytorch/README.md index dc7d729b9..71231e2c1 100644 --- a/cv/point_cloud/point-bert/pytorch/README.md +++ b/cv/point_cloud/point-bert/pytorch/README.md @@ -1,33 +1,63 @@ # Point-BERT -Point-BERT is a new paradigm for learning Transformers to generalize the concept of BERT onto 3D point cloud. Inspired by BERT, we devise a Masked Point Modeling (MPM) task to pre-train point cloud Transformers. Specifically, we first divide a point cloud into several local patches, and a point cloud Tokenizer is devised via a discrete Variational AutoEncoder (dVAE) to generate discrete point tokens containing meaningful local information. Then, we randomly mask some patches of input point clouds and feed them into the backbone Transformer. The pre-training objective is to recover the original point tokens at the masked locations under the supervision of point tokens obtained by the Tokenizer. +## Model Description -## Step 1: Installing packages +Point-BERT is a new paradigm for learning Transformers to generalize the concept of BERT onto 3D point cloud. Inspired +by BERT, we devise a Masked Point Modeling (MPM) task to pre-train point cloud Transformers. Specifically, we first +divide a point cloud into several local patches, and a point cloud Tokenizer is devised via a discrete Variational +AutoEncoder (dVAE) to generate discrete point tokens containing meaningful local information. Then, we randomly mask +some patches of input point clouds and feed them into the backbone Transformer. The pre-training objective is to recover +the original point tokens at the masked locations under the supervision of point tokens obtained by the Tokenizer. + +## Model Preparation + +### Prepare Resources + +Please refer to [DATASET.md](./DATASET.md) for preparing `ShapeNet55` and `processed ModelNet`. +The dataset dircectory tree would be like: + +```bash +data/ +├── ModelNet +│   └── modelnet40_normal_resampled +│      ├── modelnet40_test_8192pts_fps.dat +│      └── modelnet40_train_8192pts_fps.dat +├── ScanObjectNN_shape_names.txt +├── ShapeNet55-34 +│   ├── ShapeNet-55 +│   │   ├── test.txt +│   │   └── train.txt +│   └── shapenet_pc +└── shapenet_synset_dict.json +``` + +### Install Dependencies > Warning: Now only support Ubuntu OS. If your OS is centOS, you may need to compile open3d from source. * system -```sh +```bash apt update apt install libgl1-mesa-glx ``` * python -```sh -pip3 install argparse easydict h5py matplotlib numpy open3d==0.10 opencv-python pyyaml scipy tensorboardX timm==0.4.5 tqdm transforms3d termcolor scikit-learn==0.24.1 Ninja --default-timeout=1000 +```bash +pip3 install argparse easydict h5py matplotlib numpy open3d==0.10 opencv-python pyyaml scipy tensorboardX timm==0.4.5 \ + tqdm transforms3d termcolor scikit-learn==0.24.1 Ninja --default-timeout=1000 ``` * Chamfer Distance -```sh +```bash bash install.sh ``` * PointNet++ -```sh +```bash cd ./Pointnet2_PyTorch pip3 install pointnet2_ops_lib/. cd - @@ -35,35 +65,15 @@ cd - * GPU kNN -```sh +```bash pip3 install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl ``` -## Step 2: Preparing datasets - -Please refer to [DATASET.md](./DATASET.md) for preparing `ShapeNet55` and `processed ModelNet`. -The dataset dircectory tree would be like: - -```sh -data/ -├── ModelNet -│   └── modelnet40_normal_resampled -│      ├── modelnet40_test_8192pts_fps.dat -│      └── modelnet40_train_8192pts_fps.dat -├── ScanObjectNN_shape_names.txt -├── ShapeNet55-34 -│   ├── ShapeNet-55 -│   │   ├── test.txt -│   │   └── train.txt -│   └── shapenet_pc -└── shapenet_synset_dict.json -``` - -## Step 3: Training +## Model Training * dVAE train -```sh +```bash bash scripts/train.sh 0 --config cfgs/ShapeNet55_models/dvae.yaml --exp_name dVAE ``` @@ -71,10 +81,10 @@ bash scripts/train.sh 0 --config cfgs/ShapeNet55_models/dvae.yaml --exp_name dVA When dVAE has finished training, you should be edit `cfgs/Mixup_models/Point-BERT.yaml`, and add the path of dvae_config-ckpt. -```sh +```bash bash ./scripts/dist_train_BERT.sh 12345 --config cfgs/Mixup_models/Point-BERT.yaml --exp_name pointBERT_pretrain --val_freq 2 ``` -## Reference +## References -[Point-BERT](https://github.com/lulutang0608/Point-BERT) +* [Point-BERT](https://github.com/lulutang0608/Point-BERT) -- Gitee From f6627f0e6e0489df2873a4459d4852ff454b9082 Mon Sep 17 00:00:00 2001 From: "mingjiang.li" Date: Mon, 10 Mar 2025 16:06:05 +0800 Subject: [PATCH 3/9] unify model readme format - cv/mot Signed-off-by: mingjiang.li --- cv/multi_object_tracking/README.md | 1 - .../bytetrack/paddlepaddle/README.md | 59 ++++----- .../deep_sort/pytorch/README.md | 39 +++--- .../fairmot/pytorch/README.md | 119 ++++++------------ 4 files changed, 91 insertions(+), 127 deletions(-) delete mode 100644 cv/multi_object_tracking/README.md diff --git a/cv/multi_object_tracking/README.md b/cv/multi_object_tracking/README.md deleted file mode 100644 index 366545845..000000000 --- a/cv/multi_object_tracking/README.md +++ /dev/null @@ -1 +0,0 @@ -# Object Tracking diff --git a/cv/multi_object_tracking/bytetrack/paddlepaddle/README.md b/cv/multi_object_tracking/bytetrack/paddlepaddle/README.md index 66132c1aa..4e7ef4662 100644 --- a/cv/multi_object_tracking/bytetrack/paddlepaddle/README.md +++ b/cv/multi_object_tracking/bytetrack/paddlepaddle/README.md @@ -1,30 +1,25 @@ # ByteTrack -## Model description +## Model Description -ByteTrack is a simple, fast and strong multi-object tracker. +ByteTrack is an efficient multi-object tracking (MOT) model that improves tracking accuracy by associating every +detection box, including low-score ones, rather than discarding them. It addresses challenges like occluded objects and +fragmented trajectories by leveraging similarities between detections and tracklets. ByteTrack achieves state-of-the-art +performance on benchmarks like MOT17, with high MOTA, IDF1, and HOTA scores while maintaining real-time processing +speeds. Its simple yet effective design makes it a robust solution for various object tracking applications in video +analysis. -Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. When applied to 9 different state-of-the-art trackers, our method achieves consistent improvement on IDF1 scores ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU. +## Model Preparation -## Step 1: Installation +### Prepare Resources -```bash -git clone -b release/2.6 https://github.com/PaddlePaddle/PaddleDetection.git -cd PaddleDetection -pip3 install -r requirements.txt -pip3 install protobuf==3.20.3 -pip3 install urllib3==1.26.6 -yum install mesa-libGL -python3 setup.py develop -``` - -## Step 2: Preparing datasets - -Go to visit [MOT17 official website](https://motchallenge.net/), then download the MOT17 dataset, or you can download via [paddledet data](https://bj.bcebos.com/v1/paddledet/data/mot/MOT17.zip),then extract and place it in the dataset/mot/folder. +Go to visit [MOT17 official website](https://motchallenge.net/), then download the MOT17 dataset, or you can download +via [paddledet data](https://bj.bcebos.com/v1/paddledet/data/mot/MOT17.zip),then extract and place it in the +dataset/mot/folder. The dataset path structure sholud look like: -``` +```bash datasets/mot/MOT17/ ├── annotations │   ├── train_half.json @@ -39,10 +34,21 @@ datasets/mot/MOT17/ ``` -## Step 3: Training +### Install Dependencies ```bash +git clone -b release/2.6 https://github.com/PaddlePaddle/PaddleDetection.git +cd PaddleDetection +pip3 install -r requirements.txt +pip3 install protobuf==3.20.3 +pip3 install urllib3==1.26.6 +yum install mesa-libGL +python3 setup.py develop +``` +## Model Training + +```bash # One GPU CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/mot/bytetrack/detector/ppyoloe_crn_l_36e_640x640_mot17half.yml --eval --amp @@ -50,15 +56,12 @@ CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/mot/bytetrack/detector/ python3 -m paddle.distributed.launch --log_dir=ppyoloe --gpus 0,1,2,3,4,5,6,7 tools/train.py -c configs/mot/bytetrack/detector/ppyoloe_crn_l_36e_640x640_mot17half.yml --eval --amp ``` -## Results +## Model Results -| MODEL | mAP(0.5:0.95)| -| ---------- | ------------ | -| ByteTrack | 0.538 | +| Model | GPU | FPS | mAP(0.5:0.95) | +|-----------|------------|--------|---------------| +| ByteTrack | BI-V100 x8 | 4.6504 | 0.538 | -| GPUS | FPS | -| ---------- | ------ | -| BI-V100x 8 | 4.6504 | +## References -## Reference -- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection) \ No newline at end of file +- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection) diff --git a/cv/multi_object_tracking/deep_sort/pytorch/README.md b/cv/multi_object_tracking/deep_sort/pytorch/README.md index 0a8517920..bf4debfe1 100644 --- a/cv/multi_object_tracking/deep_sort/pytorch/README.md +++ b/cv/multi_object_tracking/deep_sort/pytorch/README.md @@ -1,18 +1,16 @@ # DeepSORT -## Model description +## Model Description -This is an implement of MOT tracking algorithm deep sort. Deep sort is basicly the same with sort -but added a CNN model to extract features in image of human part bounded by a detector. This CNN -model is indeed a RE-ID model and the detector used in [PAPER](https://arxiv.org/abs/1703.07402) is -FasterRCNN , and the original source code is [HERE](https://github.com/nwojke/deep_sort). However in -original code, the CNN model is implemented with tensorflow, which I'm not familier with. SO I -re-implemented the CNN feature extraction model with PyTorch, and changed the CNN model a little -bit. Also, I use **YOLOv3** to generate bboxes instead of FasterRCNN. +DeepSORT is an advanced multi-object tracking algorithm that extends SORT by incorporating deep learning-based +appearance features. It combines motion information with a CNN-based RE-ID model to track objects more accurately, +especially in complex scenarios with occlusions. DeepSORT uses a Kalman filter for motion prediction and associates +detections using both motion and appearance cues. This approach improves tracking consistency and reduces identity +switches, making it particularly effective for person tracking in crowded scenes and video surveillance applications. -We just need to train the RE-ID model! +## Model Preparation -## Preparing datasets +### Prepare Resources Download the [Market-1501](https://zheng-lab.cecs.anu.edu.au/Project/project_reid.html) @@ -50,25 +48,26 @@ data ├── gallery ``` -## Training +## Model Training -The original model used in paper is in original_model.py, and its parameter here [original_ckpt.t7](https://drive.google.com/drive/folders/1xhG0kRH1EX5B9_Iz8gQJb7UNnn_riXi6). +The original model used in paper is in original_model.py, and its parameter here +[original_ckpt.t7](https://drive.google.com/drive/folders/1xhG0kRH1EX5B9_Iz8gQJb7UNnn_riXi6). ```sh +# Train python3 train.py --data-dir your path -``` - -## Evaluate your model -```sh +# Evaluate your model. python3 test.py --data-dir your path python3 evaluate.py ``` -## Results +## Model Results -Acc top1:0.980 +| Model | GPU | Top1 ACC | +|----------|------------|----------| +| DeepSORT | BI-V100 x8 | 0.980 | -## Reference +## References -Please refer to +- [deep_sort_pytorch](https://github.com/ZQPei/deep_sort_pytorch) diff --git a/cv/multi_object_tracking/fairmot/pytorch/README.md b/cv/multi_object_tracking/fairmot/pytorch/README.md index f087b9587..e4167dd92 100644 --- a/cv/multi_object_tracking/fairmot/pytorch/README.md +++ b/cv/multi_object_tracking/fairmot/pytorch/README.md @@ -1,19 +1,18 @@ # FairMOT -## Model description +## Model Description -FairMOT is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy. +FairMOT is an innovative multi-object tracking model that unifies detection and re-identification in a single framework. +It features two homogeneous branches: one for anchor-free object detection and another for re-ID feature extraction. +Operating on high-resolution feature maps, FairMOT achieves fairness between detection and re-ID tasks, resulting in +improved tracking accuracy. Its joint learning approach eliminates the need for cascaded processing, making it more +efficient and effective for complex tracking scenarios in crowded environments. -## Step 1: Installing packages +## Model Preparation -```shell -$ pip3 install -r requirements.txt -$ pip3 install pandas progress -``` +### Prepare Resources -## Step 2: Preparing data - -### Download MOT17 dataset +Download MOT17 dataset - [Baidu NetDisk](https://pan.baidu.com/s/1lHa6UagcosRBz-_Y308GvQ) - [Google Drive](https://drive.google.com/file/d/1ET-6w12yHNo8DKevOVgK1dBlYs739e_3/view?usp=sharing) @@ -21,18 +20,18 @@ $ pip3 install pandas progress ```shell # Download MOT17 -$ mkdir -p data/MOT -$ cd data/MOT +mkdir -p data/MOT +cd data/MOT ``` ```shell -$ unzip -q MOT17.zip -$ mkdir MOT17/images && mkdir MOT17/labels_with_ids -$ mv ./MOT17/train ./MOT17/images/ && mv ./MOT17/test ./MOT17/images/ +unzip -q MOT17.zip +mkdir MOT17/images && mkdir MOT17/labels_with_ids +mv ./MOT17/train ./MOT17/images/ && mv ./MOT17/test ./MOT17/images/ -$ cd ../../ -$ python3 src/gen_labels_17.py +cd ../../ +python3 src/gen_labels_17.py ## The dataset path looks like below data/ @@ -45,7 +44,7 @@ data/    └── train ``` -### Download Pretrained models +Download Pretrained models - DLA-34 COCO pretrained model: [DLA-34 official](https://drive.google.com/file/d/18Q3fzzAsha_3Qid6mn4jcIFPeOGUaj1d) - HRNetV2-W18 ImageNet pretrained model: [BaiduYun(Access Code: r5xn)](https://pan.baidu.com/s/1Px_g1E2BLVRkKC5t-b-R5Q) @@ -53,91 +52,55 @@ data/ ```shell # Download ctdet_coco_dla_2x -$ mkdir -p models -$ cd models +mkdir -p models +cd models ``` -## Step 3: Training - -**The available train scripts are as follows:** +### Install Dependencies ```shell -train_dla34_mot17.sh -train_hrnet18_mot17.sh -train_hrnet32_mot17.sh - +pip3 install -r requirements.txt +pip3 install pandas progress ``` +## Model Training -### On single GPU +The available train scripts are as follows: ```shell -$ GPU_NUMS=1 bash