A significant bottleneck in building large-scale systems for image and video categorization is the requirement of labeled data. Manual labeling effort could be overcome by using the massive amount of web data. However, this type of data is collected through searching on the category names and is likely to inherit noise. In this study, (1) the primary objective is to improve utilizing weakly labeled data without any manual intervention. To this end, (2) we introduce a simple but effective method called "Iterative Model Evolution (I-ME)" where the goal is to discover representative instances by eliminating the irrelevant items so that the purified set can be directly used in training a model. In I-ME, (3) the elimination is done by leveraging the scores of two logistic regressors where the models are learned through iterations. We first apply our method for (4) recognizing complex human activities in images and videos and then a large-scale noisy web dataset, Clothing1M. (5) Our results are comparable to or better than the presented baselines on benchmark video datasets UCF-101, ActivityNet, FCVID and image dataset Action40. Through purifying with I-ME, we come up with only 40% of the noisy Clothing1M and we train the DNN with less but more representative training data without changing the network structure. (6) The success of I-ME over utilizing deep features supports that there is still room for improvement in exploiting large-scale weakly labeled data through mining to discover a smaller but more distinctive subset without increasing the complexity of the process.