IBM's CodeNet Dataset Can Teach AI To Translate Computer Languages

Tuesday May 11, 2021. 12:02 PM , from Slashdot

IBM announced during its Think 2021 conference on Monday that its researchers have crafted a Rosetta Stone for programming code. Engadget reports: In effect, we've taught computers how to speak human, so why not also teach computers to speak more computer? That's what IBM's Project CodeNet seeks to accomplish. 'We need our ImageNet, which can snowball the innovation and can unleash this innovation in algorithms,' [Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, said during his Think 2021 presentation]. CodeNet is essentially the ImageNet of computers. It's an expansive dataset designed to teach AI/ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across more than 55 legacy and active languages -- from COBOL and FORTRAN to Java, C++, and Python.

'Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations,' Puri explained. 'Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages.' In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code -- which, terrifyingly, still constitutes a significant amount of this country's banking and federal government infrastructure -- and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.

CodeNet can be used for functions like code search and clone detection, in addition to its intended translational duties and serving as a benchmark dataset. Also, each sample is labeled with its CPU run time and memory footprint, allowing researchers to run regression studies and potentially develop automated code correction systems. Project CodeNet consists of more than 14 million code samples along with 4000-plus coding problems collected and curated from decades' of programming challenges and competitions across the globe. 'The way the data set actually came about,' Puri said, 'there are many kinds of programming competitions and all kinds of problems -- some of them more businesslike, some of them more academic. These are the languages that have been used over the last decade and a half in many of these competitions with 1000s of students or competitors submitting solutions.' Additionally, users can run individual code samples 'to extract metadata and verify outputs from generative AI models for correctness,' according to an IBM press release. 'This will enable researchers to program intent equivalence when translating one programming language into another.' IBM intends to release the CodeNet data to the public domain, allowing researchers worldwide equal and free access.

Read more of this story at Slashdot.