A Web-Based Video Annotation Platform for Police Body-Worn Video
Photo by David McNew/Getty Images, 2017. Retrieved from here.
Photo by David Paul Morris/Bloomberg via Getty Images, June 2020. Retrieved from here.
Photo by Bruno Martins on Unsplash, n.d.
Photo by Scott Rodgerson on Unsplash, n.d.
² https://labelbox.com
³ https://prodi.gy
ASR and Speaker Diarization Module
- Uses the WhisperX model [10]
Object Detection Module
- Automated bounding box creation for people and vehicles
- Integrates YOLOv8 model, a SOTA object detection model [11]
Face Recognition Module
- Prevents creation of duplicate bounding boxes for each individual
- Employs LightFace framework [12] with RetinaFace [13] and FaceNet [14] models
Using Human-in-the-Loop components can reduce the annotators' workload and enhance data quality [7,8,9]
Task Creation and Assignment
Transcript Annotation
Creation of Bounding Boxes and Audio Tags
Annotation Object Questionnaires
Linking Transcripts to Annotation Objects
[1] California Department of Justice. (2024). RIPA Board Report 2024. Retrieved from https://oag.ca.gov/system/files/media/ripa-board-report-2024.pdf
[2] Camp, N. P., Voigt, R., Jurafsky, D., & Eberhardt, J. L. (2021). The thin blue waveform: Racial disparities in officer prosody undermine institutional trust in the police. Journal of personality and social psychology, 121(6), 1157–1171. https://doi.org/10.1037/pspa0000270
[3] Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., Jurgens, D., Jurafsky, D., & Eberhardt, J. L. (2017). Language from police body camera footage shows racial disparities in officer respect. In Proceedings of the National Academy of Sciences (Vol. 114, Issue 25, pp. 6521–6526). Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1702413114
[4] Rho, E. H., Harrington, M., Zhong, Y., Pryzant, R., Camp, N. P., Jurafsky, D., & Eberhardt, J. L. (2023). Escalated police stops of Black men are linguistically and psychologically distinct in their earliest moments. In Proceedings of the National Academy of Sciences (Vol. 120, Issue 23). Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.2216162120
[5] Prabhakaran, V., Griffiths, C., Su, H., Verma, P., Morgan, N., Eberhardt, J. L., & Jurafsky, D. (2018). Detecting Institutional Dialog Acts in Police Traffic Stops. In Transactions of the Association for Computational Linguistics (Vol. 6, pp. 467–481). MIT Press - Journals. https://doi.org/10.1162/tacl_a_00031
[6] Graham, B., Brown, L., Chochlakis, G., Dehghani, M., Delerme, R., Friedman, B., Graeden, E., Golazizian, P., Hebbar, R., Hejabi, P., & others (2024). A Multi-Perspective Machine Learning Approach to Evaluate Police-Driver Interaction in Los Angeles. arXiv preprint arXiv:2402.01703.
[7] van der Wal, D., Jhun, I., Laklouk, I., Nirschl, J., Richer, L., Rojansky, R., Theparee, T., Wheeler, J., Sander, J., Feng, F., Mohamad, O., Savarese, S., Socher, R., & Esteva, A. (2021). Biological data annotation via a human-augmenting AI-based labeling system. In npj Digital Medicine (Vol. 4, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1038/s41746-021-00520-6
[8] Franco Marchesoni-Acland, & Gabriele Facciolo. (2023). IAdet: Simplest human-in-the-loop object detection.
[9] Weber, L., & Plank, B. (2023). ActiveAED: A Human in the Loop Improves Annotation Error Detection. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 8834–8845). Association for Computational Linguistics.
[10] Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. In INTERSPEECH 2023. INTERSPEECH 2023. ISCA. https://doi.org/10.21437/interspeech.2023-78
[11] Jocher, G., Chaurasia, A., & Qiu, J.. (2023). Ultralytics YOLO.
[12] Serengil, S. I., & Ozpinar, A. (2020). LightFace: A Hybrid Deep Face Recognition Framework. In 2020 Innovations in Intelligent Systems and Applications Conference (ASYU). 2020 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE. https://doi.org/10.1109/asyu50717.2020.9259802
[13] Deng, J., Guo, J., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5202-5211).
[14] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823).