My Profile

About Me

I am currently a researcher at Docta.ai. My work focuses on data-centric AI, large language models (LLMs), and advancing responsible, explainable, and trustworthy AI. I continue to explore weakly-supervised learning techniques (including handling label noise, semi-supervised, and self-supervised learning), fairness in machine learning, and federated learning. I am particularly interested in addressing the biases present in machine learning datasets and algorithms.

I received my Ph.D. from the University of California, Santa Cruz, in 2023, where I was advised by Professor Yang Liu. Prior to UCSC, I received the B.S. degree from the University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2016, under the supervision of Prof. Wenhui Xiong, and received the M.S. degree (with honor) from ShanghaiTech University, Shanghai, China, under the supervision of Prof. Xiliang Luo.

Publication        Google Scholar

News:


[Open-Source] Docta is available [here]! Do you know your data may get sick? How healthy is your data? Let Docta treat it! .

[Data Cleaning API] Try our data cleaning [API] for text data, including preference pairs, pair-wise text scores, and individual text scores.

[2024.07] I served as an Area Chair for KDD 2025 Research Track August.

[2023.08] I graduated from UCSC.

Recent Publications and Preprints:



See Google Scholar for the most recent publications.

[Open-Source] [Docta] Docta is available [here]! Do you know your data may get sick? How healthy is your data? Let Docta treat it! Docta is an advanced data-centric AI platform that offers a comprehensive range of services aimed at detecting and rectifying issues in your data. For example, with Docta, you can automatically find annotation issues in LLM alignment data, e.g., [Anthropic RLHF data], and [MOSS public data].


[2023.05] [ICML 2023] We study how to evaluate fairness with weak proxy models when there is no any ground-truth sensitive attribute. We first show the direct use of weak proxy models can be misleading. We then study what kinds of weak proxy models are sufficient and how we can best use them. Besides, we also show using weak proxy models help protect the user's private information. This is a joint work Dr. with Kevin Yuanshun Yao, Dr. Jiankai Sun, Dr. Hang Li. and Prof. Yang Liu. Check our paper and code here: [paper] [code].




[2023.01] [ICLR 2023] We study why and how self-supervised learning features benefit the learning with noisy labels. Specifically, we analyze when and why fixing the SSL encoder performs better than an unfixed encoder, and provide insights for those settings when fixing encoder is not appropriate, how SSL features could help. We hope our observations can serve as a guidance for further research to utilize SSL features to solve noisy label problems. This is a joint work with Hao Cheng, Dr. Xing Sun and Prof. Yang Liu. Check our paper and code here: [paper] [code].



[2022.05] [ICML 2022] Learning with noisy labels has two outstanding caveats: 1) it requires customized training processes; 2) as long as the model is trained with noisy supervisions, overfitting to corrupted patterns is often hard to avoid, leading to performance drop in detection (and the chicken-egg problem). Thus we propose SimiRep, a training-free method to detect noisy labels! We can detect corrupted labels without training a model to predict. This is a joint work with Zihao Dong (our summer intern) and Prof. Yang Liu. Check our paper and code here: [paper] [code].



[2022.05] [ICML 2022] Existing estimators for noise transition matrices focus on computer vision tasks that are relatively easier to obtain high-quality representations. We empirically observe the failures of these approaches as shown in the right figure. To handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features. This is a joint work with Jialu Wang and Prof. Yang Liu. Check our paper and code here: [paper] [code].



[2022.01] [ICLR 2022] We study the real-world human annotation errors in CIFAR-10 and CIFAR-100. The real-world human annotated labels are available here. This is a joint work with Jiaheng Wei, Hao Cheng, Prof. Tongliang Liu, Prof. Gang Niu and Prof. Yang Liu. Check our paper and code here: [paper] [code] [dataset].


[2022.01] [ICLR 2022] We reveal the disparate impacts of deploying semi-supervised learning (SSL): the "rich" sub-population (higher baseline accuracy without SSL) benefits more from SSL; while the "poor" sub-population (low baseline accuracy) might even observe a performance drop after SSL. This is a joint work with Tianyi Luo and Prof. Yang Liu. Check our paper and code here: [paper] [code].



[2021.09] [NeurIPS 2021] Our work "Policy Learning Using Weak Supervision" has been accepted by NeurIPS 2021! We propose a meta framework to unify reinforcement learning (RL) and behavior cloning (BC), and a novel algorithm called PeerPL for weakly-supervised policy learning. This is a joint work with Jingkang Wang, Hongyi Guo, and Prof. Yang Liu. Check our paper and code here: [paper] [code].





[2021.08] [Best Paper] Our HOC paper won the best paper award at IJCAI 2021 workshop on Weakly Supervised Representation Learning! [paper] [code] [demo]


[2021.05] [ICML 2021] We provide a new tool to estimate noise transition matrix based on High-Order Consensuses (HOC) of noisy labels. We demonstrate that clusterability can be treated as an alternative to anchor points. When good representations of features are available, our method is model-free. This is a joint work with Yiwen Song (our summer intern) and Prof. Yang Liu.
[paper] [code] [slides] [demo]



[2021.03] [CVPR 2021 (*oral)] One paper is accepted as an oral presentation at CVPR 2021! This paper studies how peer loss performs when facing human-level instance-dependent label noise. We use some second-order information to cancel the effect of instance-dependent label noise. This is a joint work with Prof. Tongliang Liu and Prof. Yang Liu. [paper] [code] [poster] [slides]



[2021.01] [ICLR 2021] We provide a COnfidence REgularized Sample Sieve (CORES2) to deal with instance-dependent label noise. We have both theoretical guarantees (learn the clean distributions and sieve our corrupted examples) and experimental implementations (separate clean/corrupted examples, then apply semi-supervised learning techniques). This is a joint work with Hao Cheng, Xingyu Li, Yifei Gong, Dr. Xing Sun, and Prof. Yang Liu. [paper] [code] [poster] [slides]



[2020.12] [SIGMETRICS 2021] Our work Federated Bandit: A Gossiping Approach is accepted to ACM SIGMETRICS 2021!The acceptance rate this year is only 12%. In our paper, we build an analytical framework to study a private and decentralized bandit setting ("Federated Bandit"). We provide the concentration bound when heterogeneous rewards can only be shared via gossiping over a network. This is a joint work with Jingxuan Zhu, Prof. Ji Liu, and Prof. Yang Liu.
[paper] [poster] [slides].