AI code testing: Codium vs JetBrains (plus a custom OpenAI assistant)

15.04.2024 | 10 min read

Coding assistant tools are getting praise in the software development industry for many reasons. They are successfully used to automate the process of developing new features, finding bugs, and testing the code. Compared to a human programmer, they’re now at the level of a solid junior but the future looks extremely promising.

Suffice it to say, in many cases, productivity and development speed can today be significantly increased by using these AI tools, and that’s just the beginning.

This research mainly aims to compare two AI tools for generating tests in Python (fastapi), Node(nestjs), and React environments - CodiumAI and JetBrains AI assistants. To extend this comparison, we are also going to use our OpenAI custom test assistant, which is based on the GPT-4 model.

In short, our goals are to:

  • Compare the effectiveness of CodiumAI and JetBrains AI tools in generating tests for Python and React environments.
  • Introduce and assess a custom test assistant based on the OpenAI’s GPT-4 model, alongside the above.
  • Evaluate the capabilities of these tools in several simple scenarios to identify bugs and ensure code reliability.
  • Explore the benefits of AI tools in providing a quick start to testing, understanding basic test structures, and identifying edge cases.

Since code testing is quite a complex topic, we’ll need to come up with assumptions and definitions to have a clearer overview of what we want to test and why.

First, however, let’s take a look at all the AI assistants under question and their capabilities.

Codium AI Tool Overview

Codium AI helps developers to write tests for their code. It allows generating a comprehensive test suite for various types of code, including classes, functions, and code snippets.

It assists a developer by automating the test creation process, saving time and effort. Sounds too good to be true? Maybe, but we are going to find out soon.

The other features it offers (currently both free and paid versions):

  • Code explanation
  • Code suggestion
  • Pull request assistant (with review, describe, improve, and commit options)
  • Generating happy/edge/other paths

More details can be found on Codium AI

Out of the box, we've got plugins both for VS Code and also all the IntelliJ IDEs. The installation process is pretty straightforward, so we won't cover it here. After the installation, registration is required. For the test generation, we can use the free version (I haven't seen any limitations so far).

JetBrains AI Tool Overview

JetBrains AI is a new kid on the block. It came out of beta status a few months ago, and it works as an integrated plugin to all Jetbrains IDEs. AI Assistant provides AI-powered features for software development. It can explain code, answer questions about the code fragments, provide code suggestions, unit tests, generate documentation, commit messages, and much more.

Features that are currently available:

  • Code explanation
  • Refactoring suggestion
  • Adding type annotation (python only)
  • Finding code problems
  • Writing documentation
  • Generating unit tests
  • Generating code
  • Converting to another language
  • Chat (with access to VCS history)
  • Tracebacks explanation

More details can be found here Jetbrains AI.

For us in this comparison, the most important feature is the ability to generate unit tests. So, let's see how it works.

OpenAI Custom Test Assistant Overview

To have a complete comparison, however, we are going to use our custom test assistant, which is based on the GPT-4 model. The custom AI agent we used for the testing is shown below.

You can use it for the purposes of your own experiments with code testing, too.

Custom OpenAI GPT-4 AI assistant by 10Clouds.

Role

You are the "Unit Test Agent," a specialized assistant for software developers and QA engineers. Your primary role is to facilitate the creation, execution, and management of unit tests. You help in identifying the smallest testable parts of an application, writing tests for them, and running these tests to ensure the code behaves as expected under various conditions.

Your goal is to ensure the reliability and quality of the software by catching bugs and issues at the earliest stage of development.


Your Functions

| Function / Task | Description | Tools or Methodology Suggestion | Definition of Success | | --- | --- | --- | --- | |

Test Case Generation | Automatically generate test cases based on the input code. | Use static analysis and code inspection techniques. | A set of test cases that cover a significant portion of the code paths. | | Test Execution |

Run unit tests and report results. | Integration with testing frameworks like JUnit, NUnit, etc. | Successful execution of all unit tests with a detailed report of pass/fail status. | | Mocking and Simulation | Create mock objects and simulate external dependencies. | Utilize mocking frameworks like Mockito, Moq. | Ability to test units in isolation without relying on external systems. | | Test Coverage Analysis | Analyze and report on the code coverage of unit tests. | Code coverage tools like JaCoCo, Istanbul. | A report detailing the percentage of code executed by the unit tests. | | Continuous Testing Integration | Integrate unit testing into the continuous integration/continuous deployment (CI/CD) pipeline. | Use CI/CD tools like Jenkins, Travis CI, CircleCI. | Automated execution of unit tests on every code check-in or build. | | Test Optimization | Suggest optimizations to make tests more efficient and effective. | Analyze test execution patterns and results. | Recommendations for improving test speed, reliability, and coverage. |

Interaction with User

Collect Requirements: Ask the user about the specific language, framework, and components they are working with.

Test Generation: Offer to generate test cases based on the user's code snippets or descriptions.

Execution and Reporting: Execute the generated or provided tests and report the results back to the user.

Feedback Loop: Ask for feedback on the test outcomes and suggest modifications or additional tests as needed.

Outputs

Generated Test Cases: A set of unit tests tailored to the user's code.

Test Execution Report: A detailed report of the test execution, including pass/fail status, code coverage, and recommendations for improvement.

Optimization Suggestions: Recommendations for optimizing test cases and improving test coverage and efficiency.

Testing Scenarios and Results

As I mentioned before, to compare the tools from multiple perspectives and gain a comprehensive understanding of their differences, some selected code testing scenarios are needed.

I have decided to test each of the tools through a couple of simple cases - my idea was to check if they can fulfill basic assumptions regarding test setup or proper database configuration.

In this way, I could clearly see whether the results were correct or not. So without further ado, let’s start with the first scenario.

FastAPI application

Here is a simple view for creating users (this one is based on the FastAPI tutorial, you can check the rest of the code in the FastAPI tutorial).

python
@app.post("/users/")
def create_user(user: UserCreate, db: Session = Depends(get_db)) -> User:
    db_user = crud.get_user_by_email(db, email=user.email)
    if db_user:
        raise HTTPException(status_code=400, detail="Email already registered")
    return crud.create_user(db=db, user=user)

The Codium results (after fixing the prompt that states tests were part of the FastAPI module):

python
def test_create_user_valid_email_and_password(mocker):
    # Arrange
    user_data = {
        "email": "test@example.com",
        "password": "password123"
    }
    db_mock = mocker.Mock()
    crud.get_user_by_email = mocker.Mock(return_value=None)
    crud.create_user = mocker.Mock(return_value=User(**user_data))

    # Act
    response = create_user(user=UserCreate(**user_data), db=db_mock)

    # Assert
    assert response == User(**user_data)
    crud.get_user_by_email.assert_called_once_with(db_mock, email=user_data["email"])
    crud.create_user.assert_called_once_with(db=db_mock, user=UserCreate(**user_data))

And here are the JetBrains results (again, after adding the context info that this should be a FastAPI PyTest test):

python
from fastapi.testclient import TestClient
import pytest
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from database import Base
from main import app, get_db

# Create Test Client
client = TestClient(app)

# corrected
SQLALCHEMY_DATABASE_URL = "sqlite:///./test.db"

# Create a test engine
engine = create_engine(
    SQLALCHEMY_DATABASE_URL,
    connect_args={"check_same_thread": False}
)

# Create test database session local
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)


# Dependency override for get_db
def override_get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()


app.dependency_overrides[get_db] = override_get_db


# corrected ---->
@pytest.fixture(autouse=True)
def clean_database():
    Base.metadata.create_all(bind=engine)
    yield
    Base.metadata.drop_all(bind=engine)


# <---- corrected

def test_create_user():
    # Test creation of new user
    response = client.post(
        "/users/",
        json={"email": "test@example.com", "password": "testpassword"}
    )
    assert response.status_code == 200
    data = response.json()
    # Assuming that the user object contains id, email, and hashed_password
    assert data["email"] == "test@example.com"

    # Test trying to create a user with an existing email
    response_existing = client.post(
        "/users/",
        json={"email": "test@example.com", "password": "testpassword2"}
    )
    assert response_existing.status_code == 400
    assert response_existing.json() == {"detail": "Email already registered"}


And the last one is done with a custom OpenAI test assistant:

python
from fastapi.testclient import TestClient
from main import app
import models, crud
from sqlalchemy.orm import Session
from schemas import UserCreate

client = TestClient(app)


def test_create_user():
    session = Session()

    test_user = UserCreate(email="test@test.com", password="password")
    assert crud.get_user_by_email(session, email=test_user.email) is None

    response = client.post("/users/", json=test_user.dict())

    assert response.status_code == 200
    assert crud.get_user_by_email(session, email=test_user.email) is not None

    session.close()

As we can see, the result from Codium is much simpler, and it appears that the tool does not include the testing context such as setting up the database, using the appropriate API client, and cleaning the database after each test.


The result from JetBrains is considerably more complex. However, it encompasses all aspects of the testing process (though it's not flawless, as evidenced in the code above — it didn't function correctly upon generation — I had to fix some specifics to make it work properly).

The tests generated by our OpenAI custom test assistant did not work, and small improvements are not enough to make it work properly. The entire approach with the sqlalchemy Session object is wrong, and the test is not even close to being functional.

One point for JetBrains!

NestJS application

In the second example, let's take a simple controller with a single endpoint in NestJS. The point of the test is to check if the AI tool can generate a proper controller setup, mock the dependencies' injections, and test the endpoint.

python

@Controller()
@UseGuards(PermissionsGuard)
export class AuthController {
    constructor(private authService: AuthService) {
    }

    @Put('/users/:id')
    updateUser(
        @Param('id') id: string,
        @Body() body: { name: string; active: boolean },
    ) {
        this.authService.logEvent(`Updating user ${id} -> ${body}`);
        return {
            endpoint: 'users',
            message: `User ${id} updated`,
        };
    }
}

The Codium result:

python

it('should update user with valid id and body', () => {
    // Arrange
    const id = 'validId';
    const body = {name: 'John Doe', active: true};
    const authServiceMock = {
        logEvent: jest.fn(),
    };
    const authController = new AuthController(authServiceMock);

    // Act
    const result = authController.updateUser(id, body);

    // Assert
    expect(authServiceMock.logEvent).toHaveBeenCalledWith(`Updating user ${id} -> ${body}`);
    expect(result).toEqual({
        endpoint: 'users',
        message: `User ${id} updated`,
    });
});

The JetBrains result:

python

import {Test, TestingModule} from '@nestjs/testing';
import {AuthController} from './auth.controller';
import {AuthService} from './auth.service';
import {User} from './user.entity';

describe('AuthController', () => {
    let controller: AuthController;
    let service: AuthService;

    beforeEach(async () => {
        const module: TestingModule = await Test.createTestingModule({
            controllers: [AuthController],
            providers: [AuthService, {
                provide: 'USER_REPOSITORY',
                useClass: User // You'd typically use a mock instead
            }],
        }).compile();

        controller = module.get<AuthController>(AuthController);
        service = module.get<AuthService>(AuthService);
    });

    describe('updateUser', () => {
        it('should log the event and return the correct response', async () => {
            const user = {id: '1', name: 'test user', active: true};
            const spy = jest.spyOn(service, 'logEvent').mockImplementation(() => {
            });
            expect(await controller.updateUser(user.id, user)).toEqual({
                endpoint: 'users',
                message: `User ${user.id} updated`
            });
            expect(spy).toHaveBeenCalledWith(`Updating user ${user.id} -> ${JSON.stringify(user)}`);
        });
    });
});


The OpenAI custom test assistant result:

python
import {Test} from '@nestjs/testing';
import {AuthController} from './auth.controller';
import {AuthService} from './auth.service';

//added for correct AuthService mocking
const mockAuthService = () => ({
    logEvent: jest.fn(),
});

describe('Auth Controller', () => {
    let authController: AuthController;
    let authService: AuthService;

    beforeEach(async () => {
        const moduleRef = await Test.createTestingModule({
            controllers: [AuthController],
            // corrected
            providers: [{provide: AuthService, useFactory: mockAuthService}],
        }).compile();

        authController = moduleRef.get<AuthController>(AuthController);
        authService = moduleRef.get<AuthService>(AuthService);
    });

    it('Should log event correctly', async () => {
        jest.spyOn(authService, 'logEvent');
        // updated
        authController.updateUser('1', {name: 'John', active: true});
        expect(authService.logEvent).toHaveBeenCalledWith(
            'Updating user 1 -> {"name":"John","active":true}',
        );
    });

    it('Should return the correct update response', async () => {
        const response = authController.updateUser('1', {
            name: 'John',
            active: true,
        });
        expect(response).toEqual({endpoint: 'users', message: `User 1 updated`});
    });
});

In this case, the results are similar to the previous example. The Codium result is much simpler, and it appears that it does not include the testing context. It also does not set up the tests correctly (using TestingModule and mock the dependencies).

The result from JetBrains is, once again, considerably more complex, and it could be used as a base for further testing (e.g. more complex scenarios or other controllers' endpoints).

The same with our custom OpenAI test assistant - the result is quite good in this particular case and works well with only a couple of small adjustments.

Another point for JetBrains and one for our OpenAI custom test assistant!

ReactJS application

As I don't have a huge ReactJS experience, in this test I had a similar approach to what we could expect from the lower-seniority level developer.

I decided to test the tools with two scenarios. The first one was a simple component with straightforward business logic, and the second one was based on the ready-to-use React project which was available on GitHub, but hadn’t had any test.

Note that this component is far from the real-world scenario, but it's a good starting point to see how the tools are working.

In a nutshell, what we should do here is to mock the request, add the return data, check the component state, and finally check the rendered component. Additionally, the Axios request is asynchronous, so we should handle it properly to avoid checking the rendered state too early.

python
 function Posts() {
    const [posts, setPosts] = useState<Post[]>([]);

    useEffect(() => {
        axios.get<Post[]>("<https://jsonplaceholder.typicode.com/posts>")
            .then(res => {
                setPosts(res.data);
            })
            .catch(err => {
                console.error(err);
            });
    }, []);

    return (
        <div>
            <h2>Posts</h2>
            <h3>Number of posts: {posts.length}</h3>
            {posts.map(post =>
                <div key={post.id}>
                    <h2>{post.title}</h2>
                    <p>{post.body}</p>
                </div>
            )}
        </div>
    );
}

The Codium result:

python
// Generated by CodiumAI

// !! corrected render import
import {waitFor, render} from "@testing-library/react";
import axios from "axios";
// !! corrected
import Posts, {Post} from "./ApiCall";

// !! added manually
jest.mock('axios');
const mockedAxios = axios as jest.Mocked<typeof axios>;


describe('Posts', () => {

    // Renders a list of posts fetched from an external API
    it('should render a list of posts when API call is successful', async () => {
        // Arrange
        const dummyData: Post[] = [
            {userId: 1, id: 1, title: 'Post 1', body: 'Body 1'},
            {userId: 2, id: 2, title: 'Post 2', body: 'Body 2'},
        ];
        mockedAxios.get.mockResolvedValueOnce({data: dummyData});

        // Act
        const {getByText} = render(<Posts/>);

        // Assert
        await waitFor(() => {
            expect(getByText('Post 1')).toBeInTheDocument();
            expect(getByText('Body 1')).toBeInTheDocument();
            expect(getByText('Post 2')).toBeInTheDocument();
            expect(getByText('Body 2')).toBeInTheDocument();
        })
    });

    // Displays the title and body of each post
    it('should display the title and body of each post', async () => {
        // Arrange
        const mockedAxios = axios as jest.Mocked<typeof axios>;
        const dummyData: Post[] = [
            {userId: 1, id: 1, title: 'Post 1', body: 'Body 1'},
            {userId: 2, id: 2, title: 'Post 2', body: 'Body 2'},
        ];
        mockedAxios.get.mockResolvedValueOnce({data: dummyData});

        // Act
        const {getByText} = render(<Posts/>);

        // Assert
        await waitFor(() => {
            expect(getByText('Post 1')).toBeInTheDocument();
            expect(getByText('Body 1')).toBeInTheDocument();
            expect(getByText('Post 2')).toBeInTheDocument();
            expect(getByText('Body 2')).toBeInTheDocument();
        });
    });
});

The JetBrains result:

python
import React from 'react';
import {render, waitFor, screen} from '@testing-library/react';
import Posts from './ApiCall';
import axios from 'axios';
import {act} from 'react-dom/test-utils';
import '@testing-library/jest-dom/extend-expect';

jest.mock('axios');

describe('<Posts />', () => {
    test('fetches posts and renders them on mount', async () => {
        const posts = [
            {id: 1, title: 'Post 1', body: 'This is post 1'},
            {id: 2, title: 'Post 2', body: 'This is post 2'},
        ];
        (axios.get as any).mockResolvedValueOnce({data: posts});

        await act(async () => render(<Posts/>));

        await waitFor(() => screen.getByText('Number of posts: 2'));

        expect(screen.getByText('Number of posts: 2')).toBeInTheDocument();
        expect(screen.getByText('Post 1')).toBeInTheDocument();
        expect(screen.getByText('This is post 1')).toBeInTheDocument();
        expect(screen.getByText('Post 2')).toBeInTheDocument();
        expect(screen.getByText('This is post 2')).toBeInTheDocument();
        expect(axios.get).toHaveBeenCalledWith("<https://jsonplaceholder.typicode.com/posts>");
    });

    test('handles error for axios request', async () => {
        const errorMessage = 'Network Error';
        (axios.get as any).mockImplementationOnce(() =>
            Promise.reject(new Error(errorMessage)),
        );

        const {getByText} = render(<Posts/>);

        await waitFor(() => getByText(/Error/));

        expect(getByText(/Error/)).toBeInTheDocument();
    });
});

OpenAI custom test assistant result:

Codium, by default, tries to cover more scenarios including the happy path, edge cases, and custom ones if needed. Hence, it generates more tests (In this case only 2, but usually around 6 to 10, and we can ask for more). However, the quality of these tests is far from satisfactory. The tool lacks the understanding of the context and business logic, so it is unable to generate properly functioning tests. The tests are rather simple, and the tool would likely work better with simple pure functions, as opposed to a component with internal logic. The result is not even close to being functional (due to wrong imports, missing mocks, etc.).

In this particular case, I needed to fix the imports, add the proper mocks, and adjust the tests to make them work - not touching the testing scenarios at all, which I wanted to show unchanged.

As we can see with JetBrains, the results are pretty good again. The tool has generated the tests with the proper context and mocked the request. Some minor changes were required to make it work — but if you closely look at the second case (handles error for axios request), you will see that the tool has generated the test for the component that is not even rendered (we only show a console error when the request is unsuccessful.) This is a clear sign that the tool is not perfect and requires a human eye to review the generated scenarios and make the necessary adjustments.


The last result with OpenAI is quite good, and it works with small adjustments. The tool has generated the tests with the proper context and mocked the request. The tests are not perfect, but they are close to being functional. I would say that they can be used as a base for further testing and improvements. The output still requires a human (developer) to review the generated scenarios and make the necessary updates.


Here goes one point for JetBrains and one for OpenAI.

Total Results and Limitations

Overall, JetBrains scored 3 points, OpenAI 2 points and Codium, unfortunately, zero.

Although testing both tools has been a great experience, it is important to note that there are an extensive number of limitations that need to be addressed. These include:

  • Understanding the testing context
  • Providing the right data for the tests (especially for the backend tests)
  • Handling business logic and more complex scenarios
  • Inability to replace human intuition when writing tests scenarios
  • Updating the existing tests with the changes in the code or business logic (the tools are not able to do it, always generates new ones)
  • Imports, mocks, or other parts are missing and required manual adjustments
  • Hallucinations (the tools are not able to understand the context and sometimes generate the tests for the non-existing components or functions)

Conclusion and the Benefits of Using AI-assisted Code Testing

Testing process is a critical component of the software development lifecycle. There is a theory that says untested software is broken by default, which may be true in some cases. Testing tools might assist us somewhat in the process of writing the tests, but they are not (at least not yet) a perfect solution.

In simpler cases, we might use them to set up the initial tests, but a deep understanding of the inner workings of the code is usually required to write proper tests.

In more complex cases, these tools might serve as a starting point, but the tests would still generally be written manually — especially if we aim to have maintainable and readable code.

Nevertheless, I can see some benefits to using these tools, especially for beginners, such as:

  • Understanding the basic structure of the tests
  • Quick start of the tests ( simple cases setup, initial mocking, etc.)
  • Basic mocking and spying understanding
  • Understanding the edge cases that should be covered by the test scenarios

So, all things considered, we can treat those tools as mildly helpful. They can be used as a guide for testing, but the final tests should still be written manually.

Anything else?

I still recommend checking out other testing examples, acquiring an example application from the web, and attempting to write tests for it (unit/integration/end-to-end). This is a good learning approach, particularly when someone of higher seniority level can review the scenarios and implementations.

Perhaps, in the future artificial intelligence tools will be able to generate comprehensive tests for us but for now, we still need to execute them the old-fashioned way.


You may also like these posts

Start a project with 10Clouds

Hire us
cookie