Download PDF version of this article PDF

Virtual Machinations: Using Large Language Models as Neural Computers

LLMs can function not only as databases, but also as dynamic, end-user programmable neural computers.

Erik Meijer

Many people view LLMs (large language models) as databases of compressed world knowledge. Just like databases, they deliver answers to given user queries. Rather than rows and columns—as in a relational database—an LLM's knowledge is encoded into billions (or even trillions) of weights that are learned during pretraining and subsequent fine-tuning (see figure 1). Additionally, LLMs are not queried using a synthetic language such as SQL but instead directly by the user, who can engage with them via natural-language conversations.

Virtual Machinations: Using Large Language Models as Neural Computers

The conceptual equivalent of a query engine inside an LLM is the next-token prediction function next ∈ token* → ℙ(token), which computes a probability distribution of next-possible tokens next(q) ∈ ℙ (token), given an initial sequence of tokens q ∈ token*. It is exactly this probabilistic nature of the next function that enables diverse and creative responses, thus differentiating LLMs from traditional databases with their deterministic answers. In the context of this article, tokens can represent any atomic unit in a sequence of items ingested or generated by the model. These units can encompass indices in the vocabulary array of the model handled by the LLM along with any higher-level entities such as words and sentences or, more generally, any user, assistant, or system message derived from chat conversations.

While the prediction of individual tokens using next forms the foundation of LLM inference, the full completion function LLM ∈ token ∗ → ℙ(token*) extends this process to probabilistic generation of multiple tokens to allow a comprehensive response to a given prompt. The implementation of the LLM algorithm relies on various sampling, sometimes called decoding, heuristics such as greedy search, beam search, or random sampling to provide a more interesting answer than would result from just picking the highest-probability option. This sampling process can be controlled using hyper-parameters such as temperature, frequency and presence penalties, logit bias, top-p, and top-k. It is also possible to terminate completion early, based on a set of given stop tokens.3

If we abstract from the ability of a model to surface probabilities, we can represent next and LLM simply as nondeterministic functions next ∈ token* → {token}, respectively LLM ∈ token* → {token∗}, or equivalently as binary relations next ⊆ token∗ × token and LLM ⊆ token∗ × token∗. Then we can describe their nondeterministic computational behavior using big-step semantics notation.17 In this case, there are two evaluation relations:

 

q t

 

which indicates that t ∈ next(q) or (q, t) ∈ next (i.e., t is the next token generated by next given the initial token sequence q).

q r

 

which indicates that r ∈ LLM(q) or (q, r) LLM (i.e., r is the answer generated by LLM for prompt q).

 

Using big-step semantics, we can concisely explain how LLM uses next to generate longer sequences of tokens as follows:

• Either the LLM function stops generating new tokens—for example, because the next predicted token would have been a stop token, since the output token limit has been reached, or simply because the generation is complete:

 

• Or, reading from top to bottom, it auto-regressively generates new tokens from the previous sequence q extended with the token t predicted by next(q):

 

These two formulas pack a lot of operational information in a small space, which is why we like big-step semantics.

 

Dynamic Data Access Using Virtual Memory

No matter how many weights an LLM has, there will always be knowledge that is out of scope since it is not or cannot be stored within the weights:

• LLMs are trained on static datasets that contain data collected only up to a certain point in time at which data collection has been cut off. For example, if we ask the model

user: What is the current temperature in Palo Alto?

it will reply something along the lines of

assistant: I'm sorry, but I don't have access to realtime information or the ability to provide current weather updates,

thus explicitly acknowledging that knowledge of the current weather is not present in the model's weights. LLMs are not always so well-behaved and often hallucinate answers that sound plausible but are unfounded, rather than simply saying they don't know. This tendency to fabricate responses is a notable challenge when working with LLMs.

• For applications such as chatbots, we need to keep track of the conversation history (user; assistant) between the user and the model. Without keeping context, it is impossible to have multi-turn conversations, which is precisely what makes LLM-based chat interfaces compelling. Obviously, this is another kind of data that exists outside the temporal scope of the model's training data.

• Neither privacy-sensitive information nor data with access restrictions should be directly encoded in the model's weights since this would effectively make that data part of the model's public knowledge base and thus leak this information in an uncontrolled manner. Furthermore, access restrictions may change or require updates over time, which is at odds with the static nature of trained model weights. It is impractical to retrain the model every time its training data is chmod-ed.

Fortunately, instead of statically ingesting information in the model's weights during training, we can dynamically load additional information via the model's input context and ask the model to take that into account as follows:4

user

Answer the question based on the following information:

As of 7:30 am PST. Coastal Flood Advisory.
Today. 55°/39°. 2%. Thu 11 | Day. 55°. 2%.
NW 6 mph. Sunshine to start,
then a few afternoon clouds.

Question: What is the current temperature in Palo Alto?

Based on the provided context, the model now is capable of answering what the current weather is:

assistant

Based on the provided information, the current temperature in Palo Alto is 55°F (Fahrenheit) as of 7:30 am PST.

 

While providing dynamic information via the model's context works well,2 the context capacity of LLMs is fundamentally limited. As of early 2024, the largest models in widespread production had an input context of 200K to 128K tokens and an output context of 4K tokens, or they divided 32K tokens across the whole context. Even with 200K tokens, or roughly 500 printed pages, it is impossible in practice to incorporate all the necessary dynamic information within the context alone.

Comparing the context of an LLM to the RAM of a conventional computer (although it is not really random access since the model actually seems to pay more attention to the beginning and end of its input10), we can borrow the idea of virtual memory from traditional operating systems.

Virtual memory provides the illusion of an infinite memory by "paging" in and out the relevant information between secondary memory and RAM. The operating system analyzes memory-access patterns and predicts which pages are likely to be needed in the near future to decide which pages to swap in and out. We also can do exactly the same thing for finding the "chunks" to load into and evict from the context of an LLM to answer the next question. This is called RAG (retrieval-augmented generation).6 Figure 2 shows the use of RAG as virtual memory.

Virtual Machinations: Using Large Language Models as Neural Computers

A common technique for implementing RAG is to compute an embedding1 of the user query and use that to find the best-matching information in a vector database21 (used as secondary storage) to include in the context. A useful mental model for embeddings is semantics-sensitive hashing, which maps high-dimensional data such as text and images to a low-dimensional vector such that semantically similar items are mapped to vectors that are geometrically close in the embedding space.

Dynamic Tool Invocation as Instruction Dispatch

Current weather is not a great example of data to fetch from secondary storage via RAG as it changes too rapidly. To access realtime (or transient) information that requires actual computation instead of retrieval, the model should have the ability to use external "tools" (see figure 3). These tools can query a database or act as interfaces to external services or APIs, thereby allowing the model to ingest dynamic and context-specific information based on actual arguments required to invoke these tools that are generated by the model itself.

Virtual Machinations: Using Large Language Models as Neural Computers

If any of these tools are backed by another LLM, this is often referred to as a multi-agent system. This is a bit pretentious since we don't call traditional OO (object-oriented) programs that use multiple objects multi-object systems.

Building on the analogy of models as databases and the fact that LLM is a relation,7 we also model tools as relations since they could be implemented using models themselves, or in general be stochastic or nondeterministic. For example, we can consider the current weather in a given city as a binary relation, as shown in table 1.

Virtual Machinations: Using Large Language Models as Neural Computers

When reasoning about invoking a tool f, we use the big step notation e v to denote that v f(e), treating the tool f as a (non-deterministic) function, or equivalently (e, v) f, treating the tool as a relation. In both cases, we say that the tool invocation f(e) returned v. For example, based on table 1, the tool invocation WEATHER("Palo Alto") may return "Rainy and 40°F", which in big-step notation looks like this:

Most language models that support the use of tools expose them as functions,15 but we believe that relations provide a more powerful, flexible, and intuitive framework than using functions.

A popular technique for integrating tool invocation into LLM interactions is the ReAct (Reason+Act) approach,23 which, given a set of possible tools, nudges the model to cycle through a sequence of explanations and tool invocations to arrive at the final answer. Here is a possible relational ReAct-style interaction sequence where the user asks for the current weather and the model hallucinates (in gray) the tool invocation to compute the answer:

system
... prompt that defines the WEATHER tool

user
What is the current weather in Palo Alto?

assistant
To get the weather in Palo Alto, I'll use the weather tool [WEATHER("Palo Alto", "Rainy and 40°F")]. So, the weather in Palo Alto is wet and 40°F.

However, the reasoning engine that orchestrates the flow beteen the tools and the LLM, intercepts the the LLM generation at the closing bracket immediately after the tool call [WEATHER("Palo Alto", "Sunny and 75F")], thus effectively ignoring the model's musings about the weather it doesn't actually know. Instead, we invoke the WEATHER tool and find "Sunny and 80°F" WEATHER("Palo Alto"). Subsequently, we ask the model to continue from where we interrupted it with the correct information (in bold):

assistant
I need to invoke the weather service to get the current weather in Palo Alto. [WEATHER("Palo Alto", "Sunny and 80°F")]

assistant
The current weather in Palo Alto is sunny and 80°F.

In the second turn, no tool invocation was necessary to generate the final answer to the user's question, and the model used the provided information from the external tool to correctly answer the question.

Big-step semantics of ReAct

ReAct-style reasoning can be captured to generate an answer to a user question by way of a sequence of tool calls

using big-step semantics as follows:

• Either the LLM completes the query without making a tool call, in which case we are done and can combine the previous tool invocations with the final answer as the result

• Or, reading from top to bottom, the LLM makes a tool call, as witnessed by generating assistant: [f(e, w). In that case, we ignore the model's hallucinated suggestion w and instead insert the correct computed value v f(e) into the list of tool invocations performed so far for assistant: (calls; [f(e, v)]), and recurse to compute the final result


With both rules, we assume an empty sequence of tool calls means assistant is empty.

Note how this second inference rule nicely illustrates the analogy of the LLM as a branch predictor. Given the query user: question and the tools executed thus far by assistant: calls, the LLM predicts the next tool invocation [f(e, w). We then actually invoke that "instruction" f(e) to compute the true value v, and continue executing with [f(e, v)].

The major difference between a conventional branch predictor and the neural one is that the latter completely hallucinates the program that it executes one instruction at a time, while a regular branch predictor predicts the next instruction to call in an already existing program.

 

Addressing unintended model interference

The root problem with the vanilla ReAct approach to adding external tools is that it allows the model full visibility into the values that float between itself and the tool invocation, leading to potential reinterpretations and corrections of tool outputs.

If the model "thinks" based on the knowledge stored in its weights that a value returned from a tool contradicts its internal knowledge, it can—and will—throw away the value obtained from the external tool. Then the model will hypothesize a new value that is more consistent with its internal-world model and use that instead of the value returned by the tool call.

For example, if the tool invocation returns a temperature of 10000°F for the weather in Palo Alto, the model will assume that is too high and silently replace it with 100°F, which it believes to be more reasonable.

user
What is the current weather in Palo Alto?

assistant To get the weather in Palo Alto, I'll use the weather tool [WEATHER("Palo Alto", "Sunny and 10000°F")]

assistant
So, the weather in Palo Alto is Sunny and 100°F.

 

The reason we added tool calls was to fix the model's inability to perform certain calculations or access realtime data. Obviously, when the model starts to interfere with the values passed to and from tools, it becomes impossible to build reliable systems on top of such a mechanism.

Maybe this problem can be solved by once again taking inspiration from traditional computer architecture. Instead of dealing with immediate values, a conventional CPU typically uses a register-to-register model for specifying instructions, where both operands and results reside in named registers. We can use the same approach: To prevent the model from interpreting the values it manipulates, we simply won't give it values anymore. Instead:

• Inside the external reasoning engine we maintain an environment where variables are bound to values (analogous to a register set) and the model is allowed to handle only names of variables when interacting with external tools.

• To achieve this, we need to prompt the model to hallucinate variable names when it invokes a tool: [WEATHER("Palo Alto", @weatherInCity)] instead of hallucinating values [WEATHER("Palo Alto", "Rainy and 20°F")] as it did previously.

• Moreover, since the answer computed by the model may now contain variables that the model does not know the values of, we need to add expressions such as [@weatherInCity] to the model's vocabulary so it can formulate the answer to the original question as an expression over these opaque variables. The reasoning engine, of course, does know the values of variables and so can show them to the user only when the model generates an expression.

• In rare cases where the model actually needs to see values to produce its prediction, we can expose those in a controlled fashion as special privileged tools.

Here is a transcript of an exchange of user and assistant messages between the client and the model using our new variable-passing approach to answer the question: What is the weather between Mountain View and Menlo Park?

As before, we start with a system message that describes the set of available tools but also instruct the model to provide a variable to bind the result of each tool call:

system
... description of available tools ...

In response to the user's implicit question about the current weather, the model again uses its internal knowledge to figure out that we are looking for the weather in Palo Alto, but it now invokes the tool with a hallucinated variable @weatherPaloAlto.

assistant
The city between Mountain View and Menlo Park is Palo Alto.
Let's find out the weather there.
[WEATHER("Palo Alto", @weatherPaloAlto)

 

As before, we stop generation at the closing bracket. But instead of ignoring the model's suggestion, this time we use the variable @weatherPaloAlto and bind it in the environment of the reasoning engine to the result of the tool call {@weatherPaloAlto = "Sunny and 80°F"}. We then resume generation without telling the model what the result of the call was by just closing the ] bracket:

assistant
The city between Mountain View and Menlo Park is Palo Alto.
Let's find out the weather there.
[WEATHER("Palo Alto", @weatherPaloAlto)]

 

We rely upon the model keeping track in its own mind that the variable @weatherPaloAlto contains the current weather in Palo Alto.

In the next round, we stop generation at the generated expression where the model informs us about the weather in Palo Alto via the variable @weatherPaloAlto that was bound in the previous step:

assistant
The current weather in Palo Alto is
[@weatherPaloAlto

 

We use the reasoning engine to compute the value of the generated expression @weatherPaloAlto, showing it to the user, indicated by Sunny and 80°F, but not to the model, and then continue generation by closing the ] bracket:

assistant
The current weather in Palo Alto is
[@weatherPaloAlto] Sunny and 80°F

assistant
.

Finally, the model signals that the answer is complete by not generating any further tool calls.

 

Operational semantics

Here is an example of operational semantics, which for simplicity returns only the final modified environment and leaves out the user and assistant tags of the trace. These semantics clearly show how the reasoning engine threads the environment σ across calls to the tool-script interpreter, while hiding any values from the model:

• When the model responds without a tool call or expression, we are done and immediately return the current environment unchanged.

 

• When the LLM suggests evaluating an expression e, we execute it in the current environment σ and show it to the user ( v), and then continue execution to compute σ′ without showing the value to the model.

 

• When the model wants to use a tool f(e,@x), we invoke f(e) inside the reasoning engine so we can use the current environment to find a value v f(e), and bind that to @x to create the updated environment that is then recursively used to generate the rest of the computation.

 

The environment σ is similar to the register set of a conventional CPU that also binds (register) names to values. The main difference is that, in a conventional CPU, there are only a fixed number of registers, but instructions such as ADD r3, r1, r2 typically also operate on register names instead of on immediate values.

 

Neural Computers

Now that you have seen how to extend LLMs with tool usage, let's zoom out and reframe the toolset that the model has at its disposal as the instruction set of a kind of "neural computer."

In this analogy, the ReAct-style interception and dispatching of tool invocations (as captured by the ReAct big-step semantics shown earlier) mirrors the control unit of a conventional computer that fetches the next instruction, sends it to the ALU (arithmetic logic unit) for execution, passes operands, and receives results via registers. The LLM's role then is indeed akin to a branch predictor within a conventional CPU, which predicts the next most likely instruction to execute, as guided by the sequence of instructions executed so far.

These striking parallels between LLMs and conventional computers, summarized in table 2, strongly suggest that instead of looking at LLMs as glorified databases, they should instead be treated as the core component of a new kind of neural computer.

Virtual Machinations: Using Large Language Models as Neural Computers

 

A Natural Language-based Reasoning Language

Having "invented" a new kind of computer, we naturally also need a novel programming language that can be used to instruct neural computers. For this, we choose a (restricted) form of logic programming embedded in natural-language comments—thereby closing the cycle between modern generative AI and old-school symbolic AI.9

Let's revisit the ReAct-style interaction sequence from before, only now assuming we have an abstract set of tools f, ... , g at our disposal. If we combine all partial assistant: responses as we did in the big-step semantics, then a ReAct-style interaction corresponds to a goal consisting of a sequence of predicates/tool calls interleaved with comments:

 

user
question

assistant
..., [f( ..., x, ...)], ..., [g( ..., y, ...)], ...

 

where x and y are terms over @-variables.

Each tool call is an interrupted hallucination by the LLM, providing a peek inside the model's "mind" regarding how it reasons when answering the given question by generating a chain of thoughts20 in the form of a sequence of tool invocations with explanations.

By externalizing this internal reasoning performed by the model in the form of a programming language—the mentalese5 of LLMs, so to speak—we can regard the assistant trace as the source code synthesized by the model from the natural-language question issued by the user.

By appealing to Tennent's principles of abstraction and parameterization,19 we then can give these goals a name and parameters h( ..., x, ...) and hence get rules of the form ...[h( ..., x, ... )] ...:- ...[f( ..., y, ... )], ...[g( ..., z, ... )], ... embedded in natural-language explanations, where x, y, z are terms over @-variables. To escape code fragments inside natural-language explanations, we use 'hedges" [ and ]. And, voilà, we've (re-)discovered literate Prolog as the externalized reasoning language for chain of thought.

Here is a concrete example of how to write a tool in our proposed reasoning language. Say we want to implement the WEATHER(@city, @weather) tool using the National Weather Service forecast API.13 Accomplishing this means going through the following sequence of steps that invoke more primitive tools:

[WEATHER(@city, @weather)]
in [@city] using the National Weather Service API :-
To find the current weather in [@city],
we first need to find the coordinates of the city via [GEO_CODE(@city, @lat, @lon)].
Then given its [@lat] and [@lon] coordinates,
we can invoke [WEATHER_GOV(@lat, @lon,
{... "forecast": @url ...})],
to obtain the forecast url [@url]
using which we can finally fetch
[HTTP_GET(@url, {... "detailedForecast":
@weather ...})] the current [@weather] for [@city].

 

Primitive tools such as GEO_CODE(@city, @lat, @lon) are facts that are implemented in a traditional (imperative) programming language.

With such a reasoning language, users not only can define new tools from scratch, but also take the model's previously generated code sequences and apply them as new tools to the model, thus creating a self-learning system. Moreover, given that the language is syntax for the model's internal reasoning, the user can also use it to provide few-shot examples to instruct the model on how to use tools.

 

Safety Using AI-enabled Formal Methods

Ensuring that superhuman AI systems benefit humanity while mitigating unintended risks is a significant priority. One interesting approach to ensuring AI safety, borrowed from programming languages, is to use proof-carrying code.14,18 The basic idea is that the user defines (potentially with the help of a model) a security policy that the AI system must obey and then, when the model generates code, it must also provide a proof that the generated code satisfies the required safety policy.

Since our language consists of only Horn clauses (like a poor man's dependent type system), the same language for writing programs can be used to define security policies using invariants, pre- and post-conditions. By keeping the language simple, we lower the bar for models to prove properties about the code while also making the code more amendable for formal verification. Although proving nontrivial specifications to be correct is challenging, the anticipated 10- to 100-fold increase in model capabilities suggests that future AI capabilities could help users create specifications and generate code, together with its corresponding formal correctness proofs. This means AI could finally make the dream of formal methods and correct-by-construction software feasible.

We believe this offers a pragmatic and extensible method for implementing AI safety that allows even nontechnical users to monitor and enforce the logical correctness and ethical compliance of all computations created by AI models. Powerful AI will finally make the dream of formal methods and correct-by-construction software feasible.

 

Tree of Thought

One topic that remains to be addressed is how to deal with errors during execution of code generated by the model. For example, what happens if the expression or statement generated by the LLM is invalid, throws an exception, or fails to verify against the specification? One common case where this happens is when the model generates the wrong arguments for a tool or does not convert values to the proper types, thus causing the tool invocation to crash.

The big-step operational semantics, however, specify only terminating and valid runs. When errors surface during execution, there is simply no derivation leading to a final result. For example, for every environment σ, there is no related environment σ′ that results from executing an ill-typed math expression such as

because there is no value v that results from executing

 

The absence of a derivation signals that the computation could not proceed normally because of errors, or nontermination.

When actually implementing big-step semantics as a reasoning engine that coordinates between the LLM and the script interpreter, we have to deal with errors as we cannot rely upon an oracle to guard against missteps. The obvious algorithm to search for a valid derivation is to use backtracking over all possible derivations. In particular, whenever the execution of an expression or a tool invocation suggested by the model fails, it's always best to back up and try to generate an alternative.

Backtracking brings our language even closer to Prolog; the only thing missing is true logic variables. Even in pure Prolog, however, logic variables are somewhat of a gimmick.12 While it seems cute in theory to run predicates backwards—for example, to split a list in two using the same code that appends it—there is often a clear direction implied in the rules because of operational constraints on relations such as @X is @Y+@Z, which requires @Y and @Z to be values in order to perform the addition.

Since we are relying on a rich set of externally defined tools, these constraints become even more prevalent, even when rules are reversible. While we can imagine representing STT(@speech, @text) and TTS(@text, @speech) as a single reversible predicate, the practicality of finding a city given the weather by running WEATHER(@city, @weather) in reverse using @city WEATHER(@weather) is more questionable. Too many cities have the same weather.

The subset of Prolog that we implemented is traditionally called "fixed-mode," with clear inputs and outputs, and where all inputs need to be fully evaluated (not to contain un-instantiated variables) and dependencies flow strictly from left to right. This means that we do not need unification but only one-way pattern-matching. While these restrictions could be lifted and true logic variables implemented, we have not yet found any real use cases for such a general model.

 

Related Work

Given the rapid spread of generative AI, the concept of using LLMs in a manner akin to more traditional computational resources has been explored to varying extents. Andrej Karpathy's tweet8 about an LLM OS has generated a lot of follow-up blog posts and papers such as one on the LLM Agent Operating System.11 The latter creates an abstract multi-agent operating system where various of the operating system components such as the scheduler and memory management system are implemented using LLMs. Karpathy's original tweet pictures the LLM as a CPU. The MemGPT system16 adds virtual-memory capabilities to a "processor LLM," but it uses function calls to let the model itself decide how to move data between the context and external storage. We prefer to decouple the LLM processor from the mechanism that manages overall context. This is actually more similar to how virtual memory works in traditional systems where the operating system—not the CPU—manages virtual memory.

What is unique about our approach is the definition of a reasoning language to run on top of the LLM-based computer or operating system. This, of course, touches on the wider field of neuro-symbolic AI, which aims to combine neural networks with symbolic reasoning.

 

References

1. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C. 2003. A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155; https://dl.acm.org/doi/10.5555/944919.944966.

2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D. 2020. Language models are few-shot learners. Proceedings of the 34th International Conference on Neural Information Processing Systems. Article 159, 1877-1901; https://dl.acm.org/doi/abs/10.5555/3495724.3495883.

3. Chip, H. 2024. Sampling for text generation; https://huyenchip.com/2024/01/16/sampling.html.

4. Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Li, L., Sui, Z. 2023. A survey on in-context learning. arXiv:2301.00234; https://arxiv.org/abs/2301.00234.

5. Fodor, J. A. 1975. The Language of Thought. New York, NY: Thomas Y. Crowell Co.

6. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., Wang, H. 2024. Retrieval-augmented generation for large language models: a survey. arXiv 2312.10997; https://arxiv.org/abs/2312.10997.

7. Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Wattenberg, M., Andreas, J., Belinkov, Y., Bau, D. 2024. Linearity of relation decoding in transformer language models. arXiv:2308.09124; https://arxiv.org/abs/2308.09124.

8. Karpathy, A. 2023. LLM OS. Bear with me I'm still cooking. X; https://x.com/karpathy/status/1723135784142979392.

9. Körner, P., Leuschel, M., Barbosa, J., Santos Costa, V., Dahl, V., Hermenegildo, M. V., Morales, J. F., Wielemaker, J., Diaz, D., Abreu, S., Ciatto, G. 2022. Fifty years of Prolog and beyond. arXiv:2201.10816; https://arxiv.org/abs/2201.10816.

10. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P. 2023. Lost in the middle: how language models use long contexts. arXi:2307.03172; https://arxiv.org/abs/2307.03172.

11. Mei, K., Li, Z., Xu, S., Ye, R., Ge, Y., Zhang, Y. 2024. AIOS: LLM agent operating system. arXiv 2403.16971; https://arxiv.org/abs/2403.16971.

12. Mellish, C. 1985. Some global optimizations for a Prolog compiler. Journal of Logic Programming 2(1), 43–66; https://www.sciencedirect.com/science/article/pii/0743106685900044.

13. National Weather Service. API Web Service documentation; https://www.weather.gov/documentation/services-web-api.

14. Necula, G. C. 2000. Proof-carrying code (abstract): design, implementation and applications. In Proceedings of the 2nd ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming, 175–177; https://dl.acm.org/doi/10.1145/351268.351289.

15. OpenAI. Function calling. OpenAI Platform documentation; https://platform.openai.com/docs/guides/function-calling.

16. Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., Gonzalez, J. E. 2023. MemGPT: towards LLMs as operating systems. arXiv:2310.08560; https://arxiv.org/abs/2310.08560.

17. Siek, J. 2012. Crash course on notation in programming language theory; https://siek.blogspot.com/2012/07/crash-course-on-notation-in-programming.html.

18. Tegmark, M., Omohundro, S. 2023. Provably safe systems: the only path to controllable AGI. arXiv:2309.01933; https://arxiv.org/abs/2309.01933.

19. Tennent, R. D. 1977. Language design methods based on semantic principles. Acta Informatica 8, 97–112; https://link.springer.com/article/10.1007/BF00289243.

20. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D. 2023. Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903; https://arxiv.org/abs/2201.11903.

21. Wikipedia. Vector database; https://en.wikipedia.org/wiki/Vector_database.

22. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., Narasimhan, K. 2023. Tree of thoughts: deliberate problem solving with large language models. arXiv:2305.10601; https://arxiv.org/abs/2305.10601.

23. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y. 2023. ReAct: synergizing reasoning and acting in language models. arXiv 2210.03629; https://arxiv.org/abs/2210.03629.

Erik Meijer brings a rare combination of technical expertise and people leadership to his latest quest to use AI to democratize end-user programming. As a renowned computer scientist, entrepreneur, and tech influencer, Meijer has made pioneering contributions to programming languages, compilers, cloud infrastructures, and AI throughout his tenures at Microsoft, Meta (Facebook), Utrecht University, and Delft University of Technology.

Copyright © 2024 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 22, no. 3
Comment on this article in the ACM Digital Library





More related articles:

Mansi Khemka, Brian Houck - Toward Effective AI Support for Developers
The journey of integrating AI into the daily lives of software engineers is not without its challenges. Yet, it promises a transformative shift in how developers can translate their creative visions into tangible solutions. As we have seen, AI tools such as GitHub Copilot are already reshaping the code-writing experience, enabling developers to be more productive and to spend more time on creative and complex tasks. The skepticism around AI, from concerns about job security to its real-world efficacy, underscores the need for a balanced approach that prioritizes transparency, education, and ethical considerations.


Divyansh Kaushik, Zachary C. Lipton, Alex John London - Resolving the Human-subjects Status of Machine Learning's Crowdworkers
In recent years, machine learning (ML) has relied heavily on crowdworkers both for building datasets and for addressing research questions requiring human interaction or judgment. The diversity of both the tasks performed and the uses of the resulting data render it difficult to determine when crowdworkers are best thought of as workers versus human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers regarding all ML crowdworkers as human subjects and others holding that they rarely constitute human subjects. Notably few ML papers involving crowdwork mention IRB oversight, raising the prospect of non-compliance with ethical and regulatory requirements.


Harsh Deokuliar, Raghvinder S. Sangwan, Youakim Badr, Satish M. Srinivasan - Improving Testing of Deep-learning Systems
We used differential testing to generate test data to improve diversity of data points in the test dataset and then used mutation testing to check the quality of the test data in terms of diversity. Combining differential and mutation testing in this fashion improves mutation score, a test data quality metric, indicating overall improvement in testing effectiveness and quality of the test data when testing deep learning systems.


Alvaro Videla - Echoes of Intelligence
We are now in the presence of a new medium disguised as good old text, but that text has been generated by an LLM, without authorial intention—an aspect that, if known beforehand, completely changes the expectations and response a human should have from a piece of text. Should our interpretation capabilities be engaged? If yes, under what conditions? The rules of the language game should be spelled out; they should not be passed over in silence.





© ACM, Inc. All Rights Reserved.