Navigation
Search
|
Chat with data the easy way in R or Python
Thursday September 4, 2025. 11:00 AM , from InfoWorld
Even if you love to code, there probably are times when you’d rather ask a question like, “What topics generated the highest reader interest this year?” than write an SQL query with phrases like STRFTIME(‘%&Y’, Date) = STRFTIME(‘%Y’, ‘now’). And, if your data set has dozens of columns, it’s nice to be able to avoid looking up or remembering the exact name of each one.
Generative AI and large language models (LLMs) ought to be able to help you “talk” to your data, since they excel at understanding language. But as any serious genAI user knows by now, you can’t necessarily trust a chatbot’s results—especially when it’s doing calculations. One way around that is to ask a model to write the code to query a data set. A human can then check to make sure the code is doing what it’s supposed to do. There are various ways to get an LLM to write code for data analysis. One of the easiest for R and Python users is with the querychat package available for both languages. Also see my quick look at Databot, an AI assistant for querying and analyzing data in R and Python. Natural language querying with querychat Querychat is a chatbot data-analysis component for the Shiny web framework. It translates a plain-language request into SQL, runs that SQL on your data, then displays both its result and the SQL code that generated it. Showing raw SQL code might not be ideal for the average user, but it’s great for programming-literate users who want to verify a model’s response. A key advantage of the querychat workflow is that you never have to send your data to an LLM, or to the cloud. Instead, you define your data structure, the LLM writes SQL code based on your definition of the data’s columns, and then that code is run on your machine and displayed for you to review. That keeps your data private. Another benefit is that this can work equally well whether your data set has 20 rows or 2 million—the limit is what your local system can handle, not the LLM’s context window. The querychat web app can answer questions about the data or filter the data display, all based on plain-language requests. I’ll demonstrate how to use querychat with public data; namely, National Football League game results from last season (and this year when available). You can see examples for both R and Python. Let’s get started! R with querychat The first thing we’ll do is get and process the NFL data. For this example, I’ll use the nflverse suite of R packages to import the NFL game data into R. You can install it from CRAN with install.packages('nflverse'). In the code below, I load the nflreadr package from the nflverse as well as dplyr for basic data wrangling and feather for a file format used by both R and Python. You should also install these packages if you want to follow along and they’re not already on your system. Also, install querychat for R from GitHub with pak::pak('posit-dev/querychat/pkg-r'). I use nflreadr’s load_schedules() function to import data for the 2024 and 2025 NFL seasons, removing all games where the result is NA (representing games that are on the schedule but haven’t been played yet): library(nflreadr) library(dplyr) library(feather) game_data_all filter(!is.na(result)) Note that the nflverse game data is also available in CSV format for seasons since 1999. So, you could import it with your favorite CSV import function instead of using nflreadr. For example: game_data_all filter(season %in% c(2024, 2025) &!is.na(result)) The load_schedules() function returns a data frame with 46 variables for metrics including game time, temperature, wind, playing surface, outdoor or dome, point spreads, and more. Run print(dictionary_schedules) to see a data frame with a data dictionary of all the fields. Now that I have the data, I need to process it. I’m going to remove some ID fields I know I don’t want and keep everything else: cols_to_remove away_score ~ home_team, away_score > home_score ~ away_team,.default = NA ), team_lost = case_when( home_score > away_score ~ away_team, away_score > home_score ~ home_team,.default = NA ) ) write_feather(games, 'games.feather') Querying your data with querychat in R You’ll likely get more accurate querychat results if you add a couple of optional arguments when setting up your chat, such as a description of your data columns. It’s better to explain what each of your data fields represents than have the model try to guess—the same as when you’re working with human colleagues! The data description can be a Markdown or a text file, and you can use your judgment on how to structure the text. The advice from the documentation is, “Just put whatever information, in whatever format, you think a human would find helpful.” Since nflreadr comes with a data dictionary for this data set, I started with that built-in dictionary_schedules data frame. I then deleted definitions of columns I removed from my data, converted the data frame into plain text, and saved that to a data_dictionary.txt file: data_dictionary filter(!(field %in% cols_to_remove) ) text_for_file
https://www.infoworld.com/article/4040535/chat-with-your-data-the-easy-way-in-r-or-python.html
Related News |
25 sources
Current Date
Sep, Thu 4 - 21:08 CEST
|