CVAT-BWV

— Parsa Hejabi, Akshay Kiran Padte, Preni Golazizian, Rajat Hebbar, Jackson Trager, Yiorgos Chochlakis, Aditya Kommineni, Ellie Graeden, Shrikanth Narayanan,
Benjamin A.T. Graham, Morteza Dehghani

A Web-Based Video Annotation Platform for Police Body-Worn Video

High-Quality Policing and Community Impact
- Essential government service, yet impacts communities unevenly [1]
- Influences public trust with over 20 million traffic stops annually [2]
Challenges with BWV Data
- Limited data analysis due to technical and resource challenges
- Often underutilized, primarily for crisis management
Potential of ML Tools
- Enables scalable analysis of BWV footage
- Promotes transparency and accountability
Studies on BWV Footage
- Revealed racial disparities in officer-civilian interactions [3,4,5]
- Underscore the importance of effective analysis
Requirement for Extensive and Diverse Human Annotation [6]

Photo by David McNew/Getty Images, 2017. Retrieved from here.

Photo by David Paul Morris/Bloomberg via Getty Images, June 2020. Retrieved from here.

Photo by Bruno Martins on Unsplash, n.d.

Photo by Scott Rodgerson on Unsplash, n.d.

Problems with Software as a Service platforms
- Cloud-based; raise data security concerns
- Closed-source; therefore, not customizable
Limited AI-assisted annotation features
Do not supporting all modalities

² https://labelbox.com

³ https://prodi.gy

ASR and Speaker Diarization Module

- Uses the WhisperX model [10]

Object Detection Module

- Automated bounding box creation for people and vehicles

- Integrates YOLOv8 model, a SOTA object detection model [11]

Face Recognition Module

- Prevents creation of duplicate bounding boxes for each individual

- Employs LightFace framework [12] with RetinaFace [13] and FaceNet [14] models

Using Human-in-the-Loop components can reduce the annotators' workload and enhance data quality [7,8,9]

Task Creation and Assignment

Transcript Annotation

Creation of Bounding Boxes and Audio Tags

Annotation Object Questionnaires

Linking Transcripts to Annotation Objects

[1] California Department of Justice. (2024). RIPA Board Report 2024. Retrieved from https://oag.ca.gov/system/files/media/ripa-board-report-2024.pdf
[2] Camp, N. P., Voigt, R., Jurafsky, D., & Eberhardt, J. L. (2021). The thin blue waveform: Racial disparities in officer prosody undermine institutional trust in the police. Journal of personality and social psychology, 121(6), 1157–1171. https://doi.org/10.1037/pspa0000270
[3] Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., Jurgens, D., Jurafsky, D., & Eberhardt, J. L. (2017). Language from police body camera footage shows racial disparities in officer respect. In Proceedings of the National Academy of Sciences (Vol. 114, Issue 25, pp. 6521–6526). Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1702413114
[4] Rho, E. H., Harrington, M., Zhong, Y., Pryzant, R., Camp, N. P., Jurafsky, D., & Eberhardt, J. L. (2023). Escalated police stops of Black men are linguistically and psychologically distinct in their earliest moments. In Proceedings of the National Academy of Sciences (Vol. 120, Issue 23). Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.2216162120

[5] Prabhakaran, V., Griffiths, C., Su, H., Verma, P., Morgan, N., Eberhardt, J. L., & Jurafsky, D. (2018). Detecting Institutional Dialog Acts in Police Traffic Stops. In Transactions of the Association for Computational Linguistics (Vol. 6, pp. 467–481). MIT Press - Journals. https://doi.org/10.1162/tacl_a_00031
[6] Graham, B., Brown, L., Chochlakis, G., Dehghani, M., Delerme, R., Friedman, B., Graeden, E., Golazizian, P., Hebbar, R., Hejabi, P., & others (2024). A Multi-Perspective Machine Learning Approach to Evaluate Police-Driver Interaction in Los Angeles. arXiv preprint arXiv:2402.01703.
[7] van der Wal, D., Jhun, I., Laklouk, I., Nirschl, J., Richer, L., Rojansky, R., Theparee, T., Wheeler, J., Sander, J., Feng, F., Mohamad, O., Savarese, S., Socher, R., & Esteva, A. (2021). Biological data annotation via a human-augmenting AI-based labeling system. In npj Digital Medicine (Vol. 4, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1038/s41746-021-00520-6

[8] Franco Marchesoni-Acland, & Gabriele Facciolo. (2023). IAdet: Simplest human-in-the-loop object detection.
[9] Weber, L., & Plank, B. (2023). ActiveAED: A Human in the Loop Improves Annotation Error Detection. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 8834–8845). Association for Computational Linguistics.
[10] Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. In INTERSPEECH 2023. INTERSPEECH 2023. ISCA. https://doi.org/10.21437/interspeech.2023-78
[11] Jocher, G., Chaurasia, A., & Qiu, J.. (2023). Ultralytics YOLO.
[12] Serengil, S. I., & Ozpinar, A. (2020). LightFace: A Hybrid Deep Face Recognition Framework. In 2020 Innovations in Intelligent Systems and Applications Conference (ASYU). 2020 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE. https://doi.org/10.1109/asyu50717.2020.9259802

[13] Deng, J., Guo, J., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5202-5211).
[14] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823).