ModaNet: A Large-scale Street Fashion Dataset with Polygon Annotations

Searching for an ideal dress or pair of shoes sometimes could be challenging, especially when you do not know the best keywords to describe what you are looking for. Luckily, the emerging smart mobile devices provide an efficient and convenient way to capture those products of interest in your photo album. The next natural thing is letting an ecommerce app like eBay figure it out for you.

Understanding clothes and broad fashion products from such an image would have huge commercial and cultural impacts on modern societies. Deploying such a technology would empower not only the fashion buyers to find what they want, but also those small and large sellers to have quicker sales with less hassle.

This technology requires excellence in several computer vision tasks: what the product is in the image (image classification), where it is (object detection, semantic image segmentation, instance segmentation), visual similarity, how to describe the product and its image (image captioning), etc. Recent works in convolutional neural networks (CNNs) have significantly improved the state-of-the-art performance of those tasks. In the image classification task, ResNeXt-101 method has achieved 85.4% in top-1-accuracy1 in ImageNet-1K; in object detection, the best method2 has achieved 52.5% mAP in the COCO 2017 benchmark for generic object detection; in semantic image segmentation, the top-performing method3 has reached 89% mIOU in PASCAL VOC leaderboard for the generic object segmentation.

Due to unique challenges in street fashion images, including wide variations in appearance, style, brand, and layering of clothing items, one remaining question is how well those object detection and semantic image segmentation algorithms perform on the street fashion dataset. By understanding the pros and cons of existing algorithms for object detection and semantic image segmentation on street fashion dataset, we would be able to provide the technology to eBay customers.

Picture4aFigure 1. Examples of Annotations in ModaNet dataset. These images contain pixel-level annotations for each product type. 


Yamaguchi et al.4 created a street fashion dataset called the Paperdoll dataset with a few hundred pixel-wise clothing annotations based on super-pixels. We are introducing a new dataset called ModaNet, which is built on top of the Paperdoll dataset and adds large-scale polygon-based fashion product annotations, as shown in Figure 1. Our dataset provides 55,176 street images, fully annotated with polygons on top of the 1 million weakly annotated street images in Paperdoll. ModaNet aims to provide a technical benchmark to fairly evaluate the progress of the latest computer vision techniques that rely on large data for fashion understanding. The rich annotation of the dataset allows measurement of the performance of state-of-the-art algorithms for object detection, semantic segmentation, and polygon prediction on street fashion images in detail. Figure 1 shows a snippet from ModaNet.

The ModaNet dataset provides a large-scale street fashion image dataset with rich annotations, including polygonal/pixel-wise segmentation masks, bounding boxes. It consists of a training set of 52,377 images and a validation set of 2,799 images. This split ensures that each category from the validation set contains at least 500 instances, so that the validation accuracy is reliable. It contains 13 meta categories, where each meta category groups highly related categories to reduce the ambiguity in the annotation process. The 13 meta categories are bag, belt, boots, footwear, outer, dress, sunglasses, scarf and tie, pants, top, shorts, skirt, and headwear. All images are annotated by human annotators. Annotators have been trained for 2 weeks before starting the annotating, and their annotation quality accuracy reached 99.5%. During the annotation process, the annotators conducted two tasks: (1) skip the images that are ambiguous to annotate, and (2) provide polygon annotations for individual objects of interest in the image and assign a label from a predefined set of 13 meta categories.

The goal of object detection in ModaNet is to localize each fashion item from the image and assign a category label that can be further used for visual search or product recommendation. We chose three most popular object detectors to evaluate their performance on the ModaNet dataset: Faster RCNN, SSD, and YOLO. Both SSD and YOLO are single-stage, real-time detectors. Faster RCNN is the representative work for the two-stage approach, which aims to give more accurate results. Specifically, in our experiments, Faster RCNN uses Inception-ResNet-v2 as its backbone network, while we chose Inception-V2 for SSD and YOLO v2 network for the YOLO detector. As shown in our experimental results, we find that more effort should be put into developing detectors that can better handle small and highly deformable objects for fashion.

Picture8aFigure 2. Semantic image segmentation results on the ModaNet dataset. The first column contains output from DeepLabV3+. The last column contains ground truth annotations.


Semantic image segmentation provides more detailed localization information. We considered several most representative approaches to evaluate on the ModaNet dataset. These approaches are: Fully Convolutional Neural Networks (FCNs), Conditional Random Fields as Recurrent Neural Networks (CRFasRNN), and DeepLabv3+. FCNs methods use a VGG network as its backbone network. We adapt the VGG network with batch normalization, which has obtained higher top-1 accuracy in the ImageNet-1K dataset. We also adapted the CRFasRNN module on top of the FCNs, and we obtain higher accurate results than FCNs. For DeepLabV3+, we take the publicly available TensorFlow implementation and ImageNet pre-trained Xception-65 model, and fine-tune on the ModaNet (see Figure 2). We find that DeepLabv3+ performs significantly better than the alternative approaches across all metrics. This shows the importance of the backbone network as well as the careful design in the CNN modules for semantic image segmentation. We also find that CRFasRNN helps to get a better shape of some objects like “outer” and “pants,” but performs poorer in small objects such as “sunglasses,” as show in Table 1.

F-1 score per category of evaluated semantic segmentation approaches.

Table 1. F-1 score per category of evaluated semantic segmentation approaches.


One immediate application of semantic image segmentation is to predict the color attribute name given a street fashion photo. We develop a prototype based on the models trained on the ModaNet dataset. We first conduct the semantic image segmentation and then predict the color attribute names by mapping the mean RGB values for each segment to a fine-grained color namespace. This gives interesting results as shown in the Figure 3.

Picture10aFigure 3. Color attribute prediction using semantic image segmentation.



  1. Mahajan,et al. Exploring the Limits of Weakly Supervised Pretraining. ArXiv, 2018.
  2. Peng, et al. MegDet: A Large Mini-Batch Object Detector. CVPR, 2018.
  3. Chen, et al. Rethinking Atrous Convolution for Semantic Image Segmentation. CVPR 2018.
  4. Yamaguchi, et al. Retrieving Similar Styles to Parse Clothing, IEEE PAMI, 2014.
  5. Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
  6. Liu et al., SSD: Single Shot MultiBox Detector. ECCV 2016.
  7. Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016.
  8. Abadi et al., TensorFlow: A system for large-scale machine learning. CoRR abs/1605.08695, 2016.
  9. Zheng et al., Conditional Random Fields as Recurrent Neural Networks. ICCV 2015.
  10. Long et al., Fully convolutional networks for semantic segmentation. CVPR 2015.
  11. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI 2017.
  12. Deng et al., ImageNet: A Large-Scale Hierarchical Image Database. CVPR 2009.