Paper, Law and Truth

The legal industry in South Africa is still dominated by paper records as a data source, limiting or problematising attempts to analyse information en masse. What strategies or techniques can be used to help in the field? OpenUp is busy exploring different types of legal record digitalisation.

Law and paper are intimate bedfellows. When the Zondo Commission reported they’d spent R1,5 million just on paper, I don’t think there was a single person in the legal industry (or ancillary to the legal industry) who was surprised. You only need to think of the South African lawyer’s cult of the pilot case; do you see a person wheeling an antiquated, giant black bag waterlogged with papers with pride behind them in this country? You just spotted an advocate. 

Composite image made by googling “Advocate South Africa”.

The ceremonial sacrifice of trees to the legal cult need not of course be an inevitability - although it is an insidious result of both the nature of application proceedings generally, and the slow digitalisation of legal service and court filings specifically. In a throwback to the 90s, in reporting to Parliament this year the Department of Justice and Office of the Chief Justice noted that “..[a] lack of [physical] space makes storing old court files very challenging”. As the rest of the world's Google Drive notifications alert them to strained virtual space, bureaucratic and paper-based practices across government and judicial structures in South Africa continue to cause physical strains. Although the digitisation of court filings through the Office of Chief Justice’s (OCJ) pilot programme “Court Online” is a step in the right direction of course, progress remains slow: in the same reporting session in Parliament, the OCJ reported it was unable to attain its (not exceptionally lofty) goal of rolling out the system to two service centres because of the pandemic, but also as a result of a security breach in September 2020. This is naturally an inauspicious start to any digital project involving swathes of sensitive and personal data.    

We can actually look back to the Zondo Commission to ground the problem even further. The Zondo Commission, somewhat tritely, is the judicial inquiry into state capture, corruption and fraud in South Africa’s public sector. Fundamentally, judicial inquiries are meant to be about truth and publicity. But how do people extract the truth from 1 petabyte of data and 8,655,530 pages of documentary evidence? Is publication enough to equate to publicity?

OpenUp believes data is a fundamental tool for exposing truth in opaque environments. One solution with swathes of documents is, of course, to digitise them - and to just digitise them ourselves. The problem with PDF documents of course, is that they are not ordinarily machine readable, and they need to pass through “optical character recognition” (OCR) to be translated from images into encoded text. GoogleDrive actually has a fair version of this freely available if you upload a PDF to your drive and the “open with” Google Docs (although most times it…won’t be pretty). But there are also more refined open-source tools for doing this too. In our Open Courts project (you can read about some of our previous exploits in trying to open judicial material here) we have been uploading the Zondo Commission documentation and transcripts on to an Aleph instance - Aleph was originally designed for investigative journalists, but basically what it allows you to do is take large volumes of PDFs and other image files (ie it can take large volumes of both structured and unstructured data) and makes them easily searchable, allowing different "entity extraction" showing relationships between datasets. You can learn a bit more of their processes through their documentation here.

OpenUp is still experimenting to see what we can learn from this. The one utility we’ve already realised is something that would have helped me as a young Constitutional Clerk, manually indexing case bundles until midnight during the Thint/Zuma versus National Director of Public Prosecutions trial - we can potentially generate indexes for the entire documentary files, and transcribed transcripts, across a whole manner of different entities, creating what is essentially a simple map for orienting yourself in the labyrinth of state capture documents. These indexes will be shared in the next project update. 

The point for now is simply this - publicity and simplicity are kissing cousins. Access to justice questions should not just be about access through the doors of the courts, but access - with clear signposts - through the documents (and data) that are meant to yield the truth. OpenUp is happy to report it is one step closer to achieving that clarity.

Share this post:
Email icon

The legal industry in South Africa is still dominated by paper records as a data source, limiting or problematising attempts to analyse information en masse. What strategies or techniques can be used to help in the field? OpenUp is busy exploring different types of legal record digitalisation.

Law and paper are intimate bedfellows. When the Zondo Commission reported they’d spent R1,5 million just on paper, I don’t think there was a single person in the legal industry (or ancillary to the legal industry) who was surprised. You only need to think of the South African lawyer’s cult of the pilot case; do you see a person wheeling an antiquated, giant black bag waterlogged with papers with pride behind them in this country? You just spotted an advocate. 

Composite image made by googling “Advocate South Africa”.

The ceremonial sacrifice of trees to the legal cult need not of course be an inevitability - although it is an insidious result of both the nature of application proceedings generally, and the slow digitalisation of legal service and court filings specifically. In a throwback to the 90s, in reporting to Parliament this year the Department of Justice and Office of the Chief Justice noted that “..[a] lack of [physical] space makes storing old court files very challenging”. As the rest of the world's Google Drive notifications alert them to strained virtual space, bureaucratic and paper-based practices across government and judicial structures in South Africa continue to cause physical strains. Although the digitisation of court filings through the Office of Chief Justice’s (OCJ) pilot programme “Court Online” is a step in the right direction of course, progress remains slow: in the same reporting session in Parliament, the OCJ reported it was unable to attain its (not exceptionally lofty) goal of rolling out the system to two service centres because of the pandemic, but also as a result of a security breach in September 2020. This is naturally an inauspicious start to any digital project involving swathes of sensitive and personal data.    

We can actually look back to the Zondo Commission to ground the problem even further. The Zondo Commission, somewhat tritely, is the judicial inquiry into state capture, corruption and fraud in South Africa’s public sector. Fundamentally, judicial inquiries are meant to be about truth and publicity. But how do people extract the truth from 1 petabyte of data and 8,655,530 pages of documentary evidence? Is publication enough to equate to publicity?

OpenUp believes data is a fundamental tool for exposing truth in opaque environments. One solution with swathes of documents is, of course, to digitise them - and to just digitise them ourselves. The problem with PDF documents of course, is that they are not ordinarily machine readable, and they need to pass through “optical character recognition” (OCR) to be translated from images into encoded text. GoogleDrive actually has a fair version of this freely available if you upload a PDF to your drive and the “open with” Google Docs (although most times it…won’t be pretty). But there are also more refined open-source tools for doing this too. In our Open Courts project (you can read about some of our previous exploits in trying to open judicial material here) we have been uploading the Zondo Commission documentation and transcripts on to an Aleph instance - Aleph was originally designed for investigative journalists, but basically what it allows you to do is take large volumes of PDFs and other image files (ie it can take large volumes of both structured and unstructured data) and makes them easily searchable, allowing different "entity extraction" showing relationships between datasets. You can learn a bit more of their processes through their documentation here.

OpenUp is still experimenting to see what we can learn from this. The one utility we’ve already realised is something that would have helped me as a young Constitutional Clerk, manually indexing case bundles until midnight during the Thint/Zuma versus National Director of Public Prosecutions trial - we can potentially generate indexes for the entire documentary files, and transcribed transcripts, across a whole manner of different entities, creating what is essentially a simple map for orienting yourself in the labyrinth of state capture documents. These indexes will be shared in the next project update. 

The point for now is simply this - publicity and simplicity are kissing cousins. Access to justice questions should not just be about access through the doors of the courts, but access - with clear signposts - through the documents (and data) that are meant to yield the truth. OpenUp is happy to report it is one step closer to achieving that clarity.