Presentation on the legal use of large language models


{ dr. Homoki Péter / 2023.06.15 }

This is a short transcript of my presentation in Warsaw at the Modern Bar Association (2023) New Legaltech challenges conference. You can find the presentation slides here. These are the notes of my presentation with slide numbers in square brackets [ ].

[1] I am a lawyer, past chairman of CCBE’s IT law committee, a member of ELTA.

[2] For the next 25 minutes, I will be talking about how large language models will affect lawyer’s everyday life. I’ll start with a short technical explanation, then discuss the practical uses of LLM-based tools, including some limitations regarding their use.

While existing models are pretty affordable to use to by both large and small firms, there are already considerable differences in how these tools can be used, like in different languages. These technical differences will leave their mark on how legal markets will develop.

I will not be able to give an in-depth discussion, so I also suggest everyone to read the Guide on the use of AI tools for lawyers and law firms as published last year by CCBE/ELF.

[3] This slide is a high-level illustration of the currently most popular LLM, GPT by OpenAI, as a type of “decoder-only Transformer” model. This illustration was made based on GPT-3 only, because that’s the last version on which we have reliable information regarding some parameters. I have to highlight that is only one specific type of LLM, however, when viewed from such a wide-angle as this, most language models work similarly.

They represent probabilities for text sequences: depending on the earlier parts of the sequence, what will be the next part?

They all work with numbers, so our first task is to transform the words into numbers in a way that retains as much as possible the semantic and contextual meaning of the words used.

This transformation is a very important part of the trick. These input tokens are guided through a number of blocks (96!) that are actually artificial neural network models each.

And out of these blocks, we receive “contextualised tokens”, that is, a more precise meaning of each token, in relation to what its meaning is in the full context.

From these contextualised tokens the first response token is generated by a probability value, and then all the next response tokens are generated with taking these response tokens and the original input tokens into account, until the LLM generates a stop token (e.g. because the required answer was very short) or for some other reasons.

These response tokens are turned back into text.

These probability values in the neural networks are learned by way of first training the networks on very large amounts of text, in what is called a self-supervised training.

So first, LLMs are trained during the training phase, and based on these training, can provide prediction (or inference).

[4] Just to show you that the family of all LLMs is very diverse. We are curretly living in a veritable “Cambrian explosion” of LLMs.

The most popular branch is the grey one, that of the decoder-only transformed based language models, but that doesn’t mean that lawyers will be using only these type of LLMs, or that these are the best tools for all kinds of legal jobs.

(Source of the picture: Yang et al.: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond)

[5] Researches have identified around 2018 that by increasing the number of parameters trained (vocabulary, transformer layers, size of the feed forward neural network etc.), LLMs will attain new, surprising abilities.

Emergent ability means that a small quantitative change causes qualitative changes – such changes are unexpected and we have no good explanation why it happens, it just happens.

A number of such emergent abilities have already become mainstream in the last few years:

Like being able to use the input text into LLMs (prompts) in generative models as instructions for a wide range of tasks, instead of having to fine-tune an LLM to a specific task (such as classification etc.). Fine-tuning has its own costs and is time-consuming, so prompts provide a desirable versatility in using these models.

A similar emergent ability of one of the largest, newest GPT model was that the logic reasoning capabilities became surprisingly better compared to the previous version.

(These emergent abilities are the reason the most popular LLMs are generative/decoder-only LLMs: it became possible to use decoder-only LLMs to tasks that were previously encoder only. But that doesn’t mean all LLMs have to be generative).

A slightly overlapping, but different area is that fine-tuning LLMs to better respond to instructions also caused it to be able to complete tasks more generally, even in other languages and in new areas. Finally, a major change was that thanks to the ChatGPT product, capabilities of these LLMs became well-known to the public. Even if such capabilities were visible for a wider group of developers and the ML community in late 2021, these changes now made these products enjoyable for almost everyone.

[6] There are lots of opportunities for us with these LLMs. Lots of ML-based solutions became more accessible for the non-specialists, like lawyers. Lawyers will be able to use such tools a lot easier, and have better chances in understanding how these tools work and where they can help them.

It’s not that lawyers themselves will be programming to make use of these tools, but that cheaper developers or even consultants will be able to help lawyers in such use. Many expensive, complex solutions can be replaced by cheaper functionalities provided by LLMs. We may also need to integrate less of these complex tools to achieve similar results.

I want to show you via some examples how easy it is ALREADY to do exciting things with very cheap API calls to powerful LLMs and some local programming, not just in theory, but already in practice. Although most of my examples are using GPT-4 from OpenAI, that’s because that’s the only versatile LLM that is usable in Hungarian in practice. Besides this involuntary OpenAI endorsement, I try NOT to show specific commerical products, but I have to mention that there are already some great products available in some fields.

[7] This is just an illustration that “prompt-based” generative tools can be used for very different purposes. In the upper left corner, you see the instruction, more generally called the “prompt” for the task and the results. It puts a label on the elements of list that is provided to the LLM. This classification task used in the example was originally (in natural language processing) considered not as a generative task, but a discriminative type of task. And still, generative tools became able to solve this type of task is, thanks to the versatility of prompts. The lower right corner you see an instruction for an information extraction task and the text from which to extract the keywords.

[8] Here, we can see that LLMs are not a replacement for structured document automation system, but are very useful addition to them… Contract automation is the oldest part of legal automation. LLMs provide new capabilities to existing products.

The example shows the versatility of GPT-4 in reliably adjusting more generic contract terms to the specific desired wording, in a very user-friendly way, in Hungarian with a very complex grammar. With a rules-based approach, this is very difficult to do and is not as reliable as this.

But LLMs can help with contract automation at levels higher than the a specific clause as well. Like at the level of the structure of the contract. It helps capturing relevant information from the contract structure that can be used both for review or assembly purposes. But more research is needed in this area.

[9] Another example using LLMs for information retrieval in a “question answering” approach. I show you a screenshot of an experiment in Hungarian.

This is called “open book” question answering that was based on the text of the Hungarian Civil Code and the Hungarian Civil Procedure Code only.

Of course, you can ask generic questions in ChatGPT as a chatbot about Hungarian law as well, and the response will sound convincing for non-lawyers. But they are usually completely wrong answers. I’ve tried to measure if this approach gives more precise, useful answers: here, we use the GPT model not as a conversation tool, but as a component in a QA system, that builds on the semantic search results, also called “information retrieval” task results.

These legal codes are too big to fit into one prompt of a GPT model, the maximum context length does not make that possible.

So we first convert the codes into embeddings, each with the size of the max. context length window, and first find those chunks of text that are closest in meaning to the question. We feed only these relevant chunks into the LLM.

The LLM has to answer our legal question based on the chunk provided.

My experiment tried to see for what kind of questions can we expect a decent response from GPT-4, and what are the difficult type of questions.

I made a very small set of model answers, and verified that with this approach, GPT-4 was able to answer 75% of the questions correctly, while with the chatbot version (using pretrained data only for answers), the correct response rate was 33% only.

I could also identify the more difficult question types where multiple steps and reasoning was needed, compared it with some different LLMs and parameters…

[10] This open book experiment has also shown why lawyers need to have their own benchmarks for question answering and similar tasks. Only these benchmarks can help us compare the capabilities of different technical approaches and different models. Generic NLP benchmarks are not useful for this domain specific measurement.

There are already a number of such benchmarks for legal uses, but separate tasks, jurisdictions and languages need separate benchmarks. Of course, it would make sense to harmonise such benchmarks across EU member states.

(See also, e.g.: Evaluation of proposed models on real-world “datasets” in Yang et al.: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond)

[11] We have to be aware of the limitations of these LLMs. The most important limitation is that most of them are not usable for languages other than English. Even if the model would be capable of speaking in a given language, they may be limited to answering in English only.

So a big question is: can national governments do something about this? Do they have the will and the necessary resources to approximate the usefulness of the best commercial large language models? Like the PULI GPT3-SX model became available in December 2022, w/ 6.7B parameters it is very capable, but that’s still far from being anywhere in the same league as a large GPT-3 model. Even this cost a lot to train.

Also, we have to be mindful of other technical limitations, e.g. even APIs do not work the same way from Hungary as from the US…

[12] Continuing with the risks of using LLMs. Even the best models constantly produce bad answers. Just using an LLM will not solve the problem of the reliability of the answers.

Sometimes, the error is due to not choosing the appropriate model. E.g. we use a generative model for purposes where a different model would be a lot better. It’s not as simple as using a versatile LLM for all possible problems. E.g. for a contract automation purpose, LLMs work only as a tiny component in a complex software.

Or in due diligence tasks, a lot more effort may be needed to be put into identifying diverse descriptions of relevant provisions, it’s not just that you can pose the question in a prompt to GPT-4.

We have to be cautious, because many legaltech providers try to sell us unreliable and untried solutions just by involving GPT-4, and then tell us that we should ensure the involvement of a lawyer to revise the output.

This is “human-in-the-loop” used as a liability shield. Similar to how “Full-self-driving” is used by Tesla in a way that contradicts its own manual.

What is the real significance of this human in the loop, anyway?

[13] Human in the loop is important now, because there are very few mature products that are using LLMs, especially GPT-4 etc.

There was simply not enough time for wide-ranging testing. As long as we are still in this research phase, we have to focus on the use of LLMs in solutions assisting lawyers, not solutions serving clients, who are not able to judge the quality of the output.

But a number of other difficulties follow from using “human-in-the-loop”: it makes very difficult to chain the output of one software directly to another, to automate these longer chains of processes. And this human will be a bottleneck for speed.

[14] This slide provides a rough overview of the AI ecosystem as it is being built now. You can see that it has multiple players with different, often competing interests, there are small players and large players, such as the commercial provideors of the largest AI models, and that many different players want to sell to customers, including lawyers.

[15] I’m sure that lawyers able to use an LLM are already at an advantage to those lawyers who are not able to use such tools. However, the competition does not end here.

How could lawyers build on such tools in the longer term, what kind of persistent advantages could lawyers acquire?

Small firms face a challenge here. Their main and only edge is just in the first line: they will be able to rely on LLMs to broaden their knowledge, boost their processes and so on. That’s the only thing that can set them apart from each other.

However, the AI ecosystem provides more opportunities for those firms who have access to more resources. These firms may capitalise in the long term on having access to different/custom applications, models or data. These competitive advantages can become long term differentiators.

Please not that the most valuable data held is client data, subject to confidentiality, and also to data protection measures. So law firms who would like to rely on such data to gain any advantage, will first need to get a kind of approval from their clients for such training.

» Back