LLMs: Prompt Engineering and Retrieval Augmented Generation for Digital Forensics

Authors: Hans Henseler, University of Applied Sciences Leiden and Netherlands Forensic Institute, Netherlands; Kwok-Yan Lam, Nanyang Technological University, Singapore; Victor C.W. Cheng, TAU Express Pte Ltd, Singapore; and Zee Kin Yeong, Singapore Academy of Law, Singapore.

DFRWS APAC 2023

Abstract

In this workshop we will provide an introduction to large language models (LLMs) and ChatGPT in particular. The introduction describes the rise of artificial intelligence during the last decade and will explain key concepts in deep learning that are at the core of LLMs. The introduction will be followed by three hands-on parts. In the first part participants will get experience with prompt engineering in ChatGPT for digital forensics. The following applications will be part of the exercises: basic principles, zero-shot, few-shot and chain-of-thought prompting, summarization, brainstorming, data transformation and programming. In the second part participants will work with a code notebook and get firsthand experience running an open source LLM using Python and compare its performance against ChatGPT. In the third and final part participants will be introduced to retrieval augmentation generation using a code notebook featuring an open source LLM, an embedding database and a sentence transformer. This will enable participants to index and ask questions about a legal document (e.g. criminal case court ruling or digital forensic expert witness report) that is new to the LLM, i.e., it was not part of the data set used to train the LLM. Requirements: Participants are expected to bring a laptop, have internet connectivity, have registered with OpenAI to have (free) access to ChatGPT (https://chat.openai.com/). Participants should also have a Gmail account to have access to Google Colab (https://colab.google/) or, alternatively, users with a powerful laptop can use their own Python IDE (e.g. Visual Studio Code: https://visualstudio.microsoft.com/downloads/) to run the notebooks locally.

Requirements:
Participants are expected to bring a laptop, have internet connectivity, have registered with OpenAI to have (free) access to ChatGPT (https://chat.openai.com/). Participants should also have a Gmail account to have access to Google Colab (https://colab.google/) or, alternatively, users with a powerful laptop can use their own Python IDE (e.g. Visual Studio Code: https://visualstudio.microsoft.com/downloads/) to run the notebooks locally.

Github repo with hands-on assignments: https://colab.research.google.com/github/HansHenseler/DFRWS-APAC-LLM-Workshop

Bio:

Hans Henseler is senior digital forensics scientist at the Netherlands Forensic Institute and a part-time professor Digital Forensics & E-Discovery at the University of Applied Sciences Leiden. He is chairman of the Board of Directors of DFRWS until recently he was a member of the board of the Netherlands Register of Court Experts (NRGD). Hans received his M.Sc. in Computer Science from Delft University of Technology and holds a Ph.D. on Artificial Neural Networks from the University of Maastricht. In 1992 he founded the department of forensic computer science at the Netherlands Forensic science Institute (NFI) and in 1996 he was vice chairman of the International Organisation on Computer Evidence (IOCE). From 2000-2006 he was CTO at ZyLAB and responsible for ZyLAB’s E-Discovery software suite devops, support and services. From 2006-2010 Hans was director of Forensic Technology Solutions at PwC and was leading the PwC Eurofirms Forensic Technology Solutions practice. He managed large scale multi-national E-Discovery projects including data collection, email review and financial data analysis. From 2010-2014 he was partner and senior E-Discovery consultant at Fox IT. Besides E-Discovery projects he also led the development of Tracks Inspector. Tracks Inspector separated from Fox-IT after a management buy-out in 2014 where Hans was CEO and co-founder until the acquisition by Magnet Forensics in 2018 where he was Director of Digital Evidence Review until March 2020.

Kwok-Yan Lam (Senior Member, IEEE) received his B.Sc. degree (1st Class Hons.) from University of London, in 1987, and Ph.D. degree from University of Cambridge, in 1990. He is the Associate Vice President (Strategy and Partnerships) in the President’s Office, and Professor in the School of Computer Science and Engineering, at the Nanyang Technological University, Singapore. He is currently also the Executive Director of the National Centre for Research in Digital Trust, and Director of the Strategic Centre for Research in Privacy-Preserving Technologies and Systems (SCRiPTS). From August 2020, he is on part-time secondment to the INTERPOL as a Consultant at Cyber and New Technology Innovation. Prior to joining NTU, he has been a Professor of the Tsinghua University, PR China (2002–2010) and a faculty member of the National University of Singapore and the University of London since 1990. He was a Visiting Scientist at the Isaac Newton Institute, Cambridge University, and a Visiting Professor at the European Institute for Systems Security. In 1998, he received the Singapore Foundation Award from the Japanese Chamber of Commerce and Industry in recognition of his research and development achievement in information security in Singapore. He is the recipient of the Singapore Cybersecurity Hall of Fame Award in 2022. His research interests include Distributed Systems, Intelligent Systems, IoT Security, Distributed Protocols for Blockchain, Homeland Security and Cybersecurity.

Zee Kin Yeong is presently the Chief Executive of the Singapore Academy of Law. Prior to this, he was the Assistant Chief Executive (Data Innovation and Protection Group) of the Infocomm Media Development Authority of Singapore (IMDA) and concurrently, the Deputy Commissioner of the Personal Data Protection Commission. During his time with the IMDA and PDPC, Zee Kin spearheaded the development of Singapore’s Model AI Governance Framework, which won the UNITU WSIS Prize in 2019. Between 2018 and 2020, he also oversaw the review and amendments to the Personal Data Protection Act 2012. Zee Kin is an internationally recognized expert on AI ethics. He is currently a member of the OECD Network of Experts on AI (ONE AI). In 2019, Mr Yeong was a member of the AI Group of Experts at the OECD (AIGO), which developed the OECD Principles on AI. These principles have been endorsed by the G20 in 2019. Zee Kin is also a well-regarded expert on data privacy issues. He has contributed to publications on legal issues relating to data privacy, and has spoken at many well-recognised international and domestic platforms on this topic. He is currently participating as an expert in the Global Partnership on AI (GPAI)’s Data Governance Working Group, which addresses data protection issues at the intersection of AI development and deployment.In addition, Zee Kin has spoken and published in areas relating to electronic evidence and intellectual property, as well as legal issues relating to Blockchain and AI deployment. Formerly a partner in Rajah & Tann LLP’s iTec (intellectual property, technology, entertainment and communications) practice group, his experience as a Technology, Media and Telecommunications lawyer spans both public and private sector.

Victor C.W. Cheng is a data scientist at TAU Express Pte Ltd performing AI research, recommending and implementing AI Solutions, mainly related to the areas of unstructured text and natural language processing. He received the M.Phil from the Hong Kong Polytechnic University and Ph.D. from the Hong Kong Baptist University. He has over 15 years solid experience in Natural Language Processing (NLP) and joined over 20 NLP projects including ICRS and EssayCritic. He is familiar with both statistical methods and deep learning for text processing, semantic analysis, semantic parsing, sentiment analysis and text mining. He is also experienced in conversational agents, LLMs, speech recognition and speech synthesis. Apart from NLP, his research interests include machine learning, topic modeling, particle filters, WIFI positioning, etc.

Downloads

LLMs: Prompt Engineering and Retrieval Augmented Generation for Digital Forensics (Slides)