Multitasking programming in Linux. Multitasking in Linux using the command line Some interesting features

25.11.2021 external HDs

Processes in UNIX

In UNIX, the main means of organization and unit of multitasking is the process. The operating system manipulates the process image, which is program code, as well as process data sections that define the execution environment.

During execution or while waiting “in the wings,” processes are contained in virtual memory with a page organization. Some of this virtual memory is mapped to physical memory. Part of the physical memory is reserved for the operating system kernel. Users can only access the memory remaining for processes. If necessary, process memory pages are swapped from physical memory to disk, to the swap area. When accessing a page in virtual memory, if it is not in physical memory, it is swapped from disk.

Virtual memory is implemented and automatically maintained by the UNIX kernel.

Process types

There are three types of processes in UNIX operating systems: systemic, daemon processes And application processes.

System processes are part of the nucleus and are always located in random access memory. System processes do not have corresponding programs in the form of executable files and are launched in a special way when the system kernel is initialized. The executing instructions and data of these processes reside in the kernel of the system, so they can call functions and access data that other processes cannot access.

System processes include the process of initial initialization, init, which is the progenitor of all other processes. Although init is not part of the kernel, and its execution occurs from an executable file, its operation is vital to the functioning of the entire system as a whole.

Demons- these are non-interactive processes that are launched in the usual way - by loading their corresponding programs into memory, and are executed in the background. Typically, daemons are launched during system initialization, but after the kernel is initialized, and ensure the operation of various UNIX subsystems: terminal access systems, printing systems, network services etc. Demons are not associated with any user. Most of the time, daemons wait for some process or other to request a certain service.

TO application processes includes all other processes running on the system. Typically, these are processes spawned within a user session. The most important user process is initial command interpreter, which provides execution of user commands on a UNIX system.

User processes can run in both interactive (preemptive) and background modes. Interactive processes have exclusive ownership of the terminal, and until such a process completes its execution, the user has no access to the command line.

Process Attributes

A UNIX process has a number of attributes that allow the operating system to control its operation. Main attributes:

· Process ID (PID), allowing the system kernel to distinguish between processes. When a new process is created, the kernel assigns it the next free (that is, not associated with any process) identifier. The assignment of an identifier usually occurs in an ascending order, i.e. The ID of the new process is greater than the ID of the process created before it. If the ID reaches the maximum value (usually 65737), the next process will receive the minimum free PID and the cycle repeats. When a process exits, the kernel releases the identifier it was using.

· Parent Process ID (PPID)– identifier of the process that spawned this process. All processes in the system except system processes and process init, which is the ancestor of the remaining processes, are generated by one of the existing or previously existing processes.

· Priority correction (NI)– the relative priority of the process, taken into account by the scheduler when determining the startup order. The actual distribution of processor resources is determined by the execution priority (attribute PRI), depending on several factors, in particular on the given relative priority. The relative priority is not changed by the system throughout the life of the process, although it can be changed by the user or administrator when starting the process using the command nice. The range of priority increment values on most systems is -20 to 20. If no increment is specified, the default value of 10 is used. A positive increment means a decrease in the current priority. Ordinary users can only set a positive increment and, thus, only reduce the priority. User root can set a negative increment, which increases the priority of the process and, thereby, contributes to its more fast work. Unlike relative priority, the execution priority of a process is dynamically changed by the scheduler.

· Terminal Line (TTY)– a terminal or pseudo-terminal associated with a process. This terminal is associated with standard streams: input, day off And message flow about errors. Streams ( program channels) are standard means interprocess communication in UNIX OS. Daemon processes are not associated with a terminal.

· Real (UID) and effective (EUID) user identifiers. Real user ID this process is the ID of the user who started the process. The effective identifier is used to determine the process's access rights to system resources (primarily resources file system). Usually the real and effective identifiers are the same, i.e. the process has the same rights in the system as the user who launched it. However, it is possible to give a process more rights than the user by setting SUID bit when the effective identifier is set to the identifier of the owner of the executable file (for example, the user root).

· Real (GID) and effective (EGID) group identifiers. The real group ID is equal to the primary or current group ID of the user who started the process. The effective identifier is used to determine access rights to system resources on behalf of a group. Usually the effective group ID is the same as the real one. But if the executable file is set to SGID bit, such a file is executed with the effective owner group ID.

Original: Light-Weight Processes: Dissecting Linux Threads
Authors: Vishal Kanaujia, Chetan Giridhar
Date of publication: August 1, 2011
Translation: A. Panin
Translation publication date: October 22, 2012

This article, intended for Linux developers and computer science students, covers the basics of threads and their implementation using lightweight processes in Linux, with source code examples for better understanding.

Program threads are the basic element of a multitasking software environment. A program thread can be described as the execution environment of a process; therefore, each process has at least one thread. Multithreading involves having multiple parallel (on multiprocessor systems) and usually synchronized execution environments for a process.

Program threads have their own identifiers (thread ID) and can be executed independently of each other. They share one process address space with each other and use this feature as an advantage, allowing them to avoid using IPC channels (interprocess communication systems - shared memory, pipes and other systems) for data exchange. Threads of a process can interact - for example, independent threads can get/change the value of a global variable. This communication model eliminates the overhead of IPC calls at the kernel level. Because threads operate in a single address space, thread context switches are fast and not resource-intensive.

Each program thread can be processed individually by the task scheduler, so multithreaded applications are well suited for parallel execution on multiprocessor systems. In addition, the creation and destruction of threads is fast. Unlike a fork() call, a thread does not create a copy of the parent process's address space; instead, threads share the address space along with other resources, including file handles and signal handlers.

A multi-threaded application uses resources optimally and as efficiently as possible. In such an application, program threads perform different tasks based on optimal use systems. One thread can read a file from disk, another can write data to a socket. Both threads will work in tandem, being independent of each other. This model optimizes system usage, thereby improving performance.

Several interesting features

The most well-known feature of working with threads is their synchronization, especially when there is a shared resource marked as a critical segment. This is the segment of code that accesses a shared resource and should not be accessible by more than one thread at any time. Since each thread can execute independently, access to the shared resource is not controlled, which leads to the need to use synchronization primitives, including mutexes (mutual exclusion), semaphores, read/write locks, and others.

These primitives allow programmers to control access to a shared resource. In addition to the above, threads, like processes, are susceptible to entering blocking or waiting states if the synchronization model is not designed correctly. Debugging and analyzing multi-threaded applications can also be quite cumbersome.

How are threads implemented in Linux?

Linux allows you to develop and use multi-threaded applications. At the user level, the implementation of threads in Linux follows the open POSIX standard (Portable Operating System Interface for uniX - Portable Interface for Unix Operating Systems), designated IEEE 1003. The user-level library (glibc.so on Ubuntu) provides an implementation of the POSIX API for threads.

In Linux, program threads exist in two separate spaces - user space and kernel space. In user space, threads are created using the POSIX-compliant pthread library API. These user space threads are inextricably linked to kernel space threads. In Linux, kernel space threads are treated as "lightweight processes". A lightweight process is a unit of the main runtime environment. Unlike various options UNIX, including systems such as HP-UX and SunOS, Linux does not have a separate threading system. A process or thread in Linux is treated as a "task" and uses the same internal structures (a series of struct task_structs structures).

For a number of process threads created in user space, there are a number of lightweight processes associated with them in the kernel. An example illustrates this point: #include #include #include Int main() ( pthread_t tid = pthread_self(); int sid = syscall(SYS_gettid); printf("LWP id is %dn", sid); printf("POSIX thread id is %dn", tid); return 0; )

Using the ps utility, you can get information about processes, as well as lightweight processes/threads of these processes: kanaujia@ubuntu:~/Desktop$ ps -fL UID PID PPID LWP C NLWP STIME TTY TIME CMD kanaujia 17281 5191 17281 0 1 Jun11 pts/ 2 00:00:02 bash kanaujia 22838 17281 22838 0 1 08:47 pts/2 00:00:00 ps -fL kanaujia 17647 14111 17647 0 2 00:06 pts/0 00:00:00 vi clone.s

What are lightweight processes?

A lightweight process is a process that supports the user space thread. Every user space thread is inextricably linked to a lightweight process. The procedure for creating a lightweight process is different from the procedure for creating a regular process; User process "P" may have a number of related lightweight processes with the same group ID. Grouping allows the kernel to partition resources (resources include address space, physical memory (VM) pages, signal handlers, and file descriptors). This also allows the kernel to avoid context switches when dealing with these processes. Exhaustive resource sharing is the reason why these processes are called lightweight.

How does Linux create lightweight processes?

IN Linux creation lightweight processes are implemented using the non-standardized clone() system call. It's similar to fork(), but with more functionality. In general, a fork() call is implemented using a clone() call with additional parameters, indicating resources that will be shared between processes. Calling clone() creates a process, with the child sharing runtime elements with the parent, including memory, file handles, and signal handlers. The pthread library also uses the clone() call to implement threads. Refer to the source code file ./nptl/sysdeps/pthread/createthread.c in the glibc source code directory version 2.11.2.

Creating your own lightweight process

Let's demonstrate an example of using the clone() call. Take a look at the source code from the demo.c file below:

#include #include #include #include #include #include #include //stack size 64kB #define STACK 1024*64 // The child thread will execute this function int threadFunction(void* argument) ( printf("child thread entering\n"); close((int*)argument); printf("child thread exiting\n"); return 0; ) int main() ( void* stack; pid_t pid; int fd; fd = open("/dev/null", O_RDWR); if (fd< 0) { perror("/dev/null"); exit(1); } // Резервирование памяти для стека stack = malloc(STACK); if (stack == 0) { perror("malloc: could not allocate stack"); exit(1); } printf("Creating child thread\n"); // Вызов clone() для создания дочернего потока pid = clone(&threadFunction, (char*) stack + STACK, SIGCHLD | CLONE_FS | CLONE_FILES |\ CLONE_SIGHAND | CLONE_VM, (void*)fd); if (pid == -1) { perror("clone"); exit(2); } // Ожидание завершения дочернего потока pid = waitpid(pid, 0, 0); if (pid == -1) { perror("waitpid"); exit(3); } // Попытка записи в файл закончится неудачей, так как поток // закрыл файл if (write(fd, "c", 1) < 0) { printf("Parent:\t child closed our file descriptor\n"); } // Освободить память, используемую для стека free(stack); return 0; }

The demo.c program allows you to create threads in essentially the same way as the pthread library. However, using the clone() call directly is not advisable because if used incorrectly, the application under development may fail. The syntax of clone() function in Linux is given below: #include int clone (int (*fn) (void *), void *child_stack, int flags, void *arg);

The first argument is the thread function; it is called when the thread is started. After the clone() call completes successfully, the fn function begins executing concurrently with the calling process.

The next argument is a pointer to a memory location for the child process's stack. The step before calling fork() and clone() requires the programmer to reserve memory and pass a pointer to use as the stack of the child process, since the parent and child processes share memory pages - these include the stack. A child process can call a different function than the parent process, which is why a separate stack is required. In our program, we reserve this piece of memory on the heap using the malloc() function. The stack size was set to 64 KB. Since the x86 stack grows downwards, it is necessary to simulate similar behavior by using allocated memory from the end of the section. For this reason, we pass the following address to the clone() function: (char*) stack + STACK

The next flags argument is especially important. It allows you to specify which resources should be shared with the created process. We chose the bitmask SIGCHLD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_VM described below:

SIGCHLD: The thread sends the SIGCHLD signal to the parent process upon completion. Setting this option allows the parent process to use the wait() function to wait for all threads to complete.
CLONE_FS : Share file system information between the parent process and thread. The information includes the filesystem root, working directory, and umask value.
CLONE_FILES: Share the file descriptor table between the parent process and thread. Changes to the table are reflected in the parent process and all threads.
CLOSE_SIGHAND : Share the signal handler table between the parent and child processes. Again, if the parent process or one of the threads changes the signal handler, the change will be reflected in the tables of the other processes.
CLONE_VM: Parent process and threads run in the same memory space. All memory writes or mappings made by one of them are available to other processes.

The last parameter is the argument passed to the function (threadFunction), in our case a file descriptor.

Please refer to the example of working with lightweight processes demo.c that we published earlier.

The thread closes the file (/dev/null) opened by the parent process. Because the parent process and thread share the file descriptor table, the file closing operation affects the parent process as well, causing subsequent write() calls to fail. The parent process waits for the thread to terminate (the moment it receives the SIGCHLD signal). After that, it frees the reserved memory and returns control.

Compile and run the program as usual; the output should be similar to the one below: $gcc demo.c $./a.out Creating child thread child thread entering child thread exiting Parent: child closed our file descriptor $

Linux provides support for an efficient, simple, and scalable threading infrastructure. This has stimulated programmers' interest in experimenting with and developing threading libraries that use clone() as a core function.

We continue the topic of multithreading in the Linux kernel. Last time I talked about interruptions, their processing and tasklets, and since it was originally intended that this would be one article, in my story about workqueue I will refer to tasklets, assuming that the reader is already familiar with them.
Like last time, I will try to make my story as detailed and detailed as possible.

Articles in the series:

Multitasking in the Linux kernel: workqueue

Workqueue

Workqueue- these are more complex and heavyweight entities than tasklets. I won’t even try to describe all the intricacies of the implementation here, but I hope I will analyze the most important things in more or less detail.
Workqueues, like tasklets, serve for deferred interrupt processing (although they can be used for other purposes), but, unlike tasklets, they are executed in the context of a kernel process; accordingly, they do not have to be atomic and can use sleep() function, various synchronization tools, etc.

Let's first understand how the workqueue processing process is organized in general. The picture shows it very approximately and simplified, how everything actually happens is described in detail below.

Several entities are involved in this dark matter.
Firstly, work item(just work for short) is a structure that describes the function (for example, an interrupt handler) that we want to schedule. It can be thought of as an analogue of the tasklet structure. When scheduling, Tasklets were added to queues hidden from the user, but now we need to use a special queue - workqueue.
Tasklets are raked by the scheduler function, and workqueue is processed by special threads called workers.
Worker's provide asynchronous execution of work's from workqueue. Although they call work in order of rotation, in the general case there is no question of strict, sequential execution: after all, preemption, sleep, waiting, etc. take place here.

In general, workers are kernel threads, that is, they are controlled by the main Linux kernel scheduler. But workers partially intervene in planning for additional organization of parallel execution of work. This will be discussed in more detail below.

To outline the main capabilities of the workqueue mechanism, I suggest exploring the API.

About the queue and its creation

alloc_workqueue(fmt, flags, max_active, args...)
The parameters fmt and args are the printf format for the name and the arguments to it. The max_activate parameter is responsible for the maximum number of works that from this queue can be executed in parallel on one CPU.
A queue can be created with the following flags:

WQ_HIGHPRI
WQ_UNBOUND
WQ_CPU_INTENSIVE
WQ_FREEZABLE
WQ_MEM_RECLAIM

Particular attention should be paid to the flag WQ_UNBOUND. Based on the presence of this flag, queues are divided into bound and untethered.
In linked queues When added, work's are bound to the current CPU, that is, in such queues, work's are executed on the core that schedules it. In this regard, bound queues resemble tasklets.
In unattached queues work's can be executed on any core.

An important feature of the workqueue implementation in the Linux kernel is the additional organization of parallel execution that is present in bound queues. It is written in more detail below, but now I will say that it is carried out in such a way that as little memory as possible is used, and so that the processor does not stand idle. This is all implemented with the assumption that one work does not use too many processor cycles.
This is not the case for unattached queues. Essentially, such queues simply provide context to workers and start them as early as possible.
Thus, unattached queues should be used if CPU-intensive workload is expected, since in this case the scheduler will take care of parallel execution on multiple cores.

By analogy with tasklets, works can be assigned execution priority, normal or high. The priority is common for the entire queue. By default, the queue has normal priority, and if you set the flag WQ_HIGHPRI, then, accordingly, high.

Flag WQ_CPU_INTENSIVE only makes sense for bound queues. This flag is a refusal to participate in an additional organization of parallel execution. This flag should be used when work is expected to consume a lot of CPU time, in which case it is better to shift the responsibility to the scheduler. This is described in more detail below.

Flags WQ_FREEZABLE And WQ_MEM_RECLAIM are specific and beyond the scope of the topic, so we will not dwell on them in detail.

Sometimes it makes sense not to create your own queues, but to use common ones. The main ones:

system_wq - bound queue for fast work
system_long_wq - a bound queue for works that are expected to take a long time to execute
system_unbound_wq - unbound queue

About work and their planning

Now let's deal with the works. First, let's look at the initialization, declaration, and preparation macros:
DECLARE(_DELAYED)_WORK(name, void (*function)(struct work_struct *work)); /* at compile time */ INIT(_DELAYED)_WORK(_work, _func); /* during execution */ PREPARE(_DELAYED)_WORK(_work, _func); /* to change the function being executed */
Works are added to the queue using the functions:
bool queue_work(struct workqueue_struct *wq, struct work_struct *work); bool queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay); /* work will be added to the queue only after the delay has expired */
This is worth dwelling on in more detail. Although we specify a queue as a parameter, in fact, work’s are not placed in the workqueue itself, as it might seem, but in a completely different entity - in the queue list of the worker_pool structure. Structure worker_pool, in fact, is the most important entity in organizing the workqueue mechanism, although for the user it remains behind the scenes. It is with them that workers work, and it is in them that all the basic information is contained.

Now let's see what pools are in the system.
To begin with, pools for bound queues (in the picture). For each CPU, two worker pools are statically allocated: one for high-priority work, the other for work with normal priority. That is, if we have four cores, then there will be only eight attached pools, despite the fact that the workqueue can be as many as desired.
When we create a workqueue, it has a service allocated for each CPU pool_workqueue(pwq). Each such pool_workqueue is associated with a worker pool, which is allocated on the same CPU and corresponds in priority to the queue type. Through them, the workqueue interacts with the worker pool.
Workers execute work from the worker pool indiscriminately, without distinguishing which workqueue they originally belonged to.

For unattached queues, worker pools are allocated dynamically. All queues can be divided into equivalence classes according to their parameters, and for each such class its own worker pool is created. They are accessed using a special hash table, where the key is a set of parameters, and the value, respectively, is the worker pool.
In fact, for unbound queues everything is a little more complicated: if for bound queues pwq and queues were created for each CPU, here they are created for each NUMA node, but this is an additional optimization that we will not consider in detail.

All sorts of little things

I’ll also give a few functions from the API to complete the picture, but I won’t talk about them in detail:
/* Force completion */ bool flush_work(struct work_struct *work); bool flush_delayed_work(struct delayed_work *dwork); /* Cancel execution of work */ bool cancel_work_sync(struct work_struct *work); bool cancel_delayed_work(struct delayed_work *dwork); bool cancel_delayed_work_sync(struct delayed_work *dwork); /* Delete a queue */ void destroy_workqueue(struct workqueue_struct *wq);

How workers do their job

Now that we are familiar with the API, let's try to understand in more detail how it all works and is managed.
Each pool has a set of workers who handle tasks. Moreover, the number of workers changes dynamically, adapting to the current situation.
As we have already found out, workers are threads that perform work in the context of the kernel. The worker retrieves them in order, one after another, from the worker pool associated with it, and the workers, as we already know, can belong to different source queues.

Workers can conditionally be in three logical states: they can be idle, running, or managing.
Worker can stand idle and do nothing. This is, for example, when all the work is already being executed. When a worker enters this state, it goes to sleep and, accordingly, will not execute until it is woken up;
If pool management is not required and the list of scheduled works is not empty, then the worker starts executing them. We will conventionally call such workers running.
If necessary, the worker takes on the role manager pool. A pool can have either only one managing worker or no worker at all. Its task is to maintain the optimal number of workers per pool. How he does it? Firstly, workers that have been idle for a long time are deleted. Secondly, new workers are created if three conditions are met at once:

there are still tasks to complete (works in the pool)
no idle workers
there are no working workers (that is, active and not sleeping)

However, the last condition has its own nuances. If the pool queues are unattached, then running workers are not taken into account; for them this condition is always true. The same is true in the case of a worker executing a task from a linked one, but with the flag WQ_CPU_INTENSIVE, queues. Moreover, in the case of tied queues, since workers work with works from the common pool (which is one of two for each core in the picture above), it turns out that some of them are counted as working, and some are not. It also follows that performing work from WQ_CPU_INTENSIVE queues may not start immediately, but they themselves do not interfere with the execution of other work. Now it should be clear why this flag is called that, and why it is used when we expect work to take a long time to complete.

Accounting for working workers is carried out directly from the main Linux kernel scheduler. This control mechanism ensures an optimal concurrency level, preventing the workqueue from creating too many workers, but also not causing the work to wait unnecessarily for too long.

Those who are interested can look at the worker function in the kernel, it is called worker_thread().

All described functions and structures can be found in more detail in the files include/linux/workqueue.h, kernel/workqueue.c And kernel/workqueue_internal.h. There is also documentation on workqueue in Documentation/workqueue.txt.

It is also worth noting that the workqueue mechanism is used in the kernel not only for deferred interrupt processing (although this is a fairly common scenario).

Thus, we looked at the mechanisms for deferred interrupt handling in the Linux kernel - tasklet and workqueue, which are a special form of multitasking. You can read about interrupts, tasklets and workqueues in the book “Linux Device Drivers” by Jonathan Corbet, Greg Kroah-Hartman, Alessandro Rubini, although the information there is sometimes outdated.

Together with the university.

Naturally, I had 10 minutes, so the presentation is at a gallop across Europe, much is simplified, much is left out.

A little history

Relatively detailed history The creation of the Linux kernel can be found in the famous book by Linus Torvalds “Just for fun”. We are interested in the following facts:

The kernel was created in 1991 by Linus Torvalds, a student at the University of Helsinki;

As a platform, he used the Minix OS, written by his teacher Andrew Tanenbaum, running on personal computer With Intel processor 80386;

He used the Unix family of operating systems as an example to follow, and first the POSIX standard as a guide, and then simply source codes programs from the GNU package (bash, gcc, etc.).

These facts largely determined the future development of the nucleus; their consequences are also noticeable in the modern nucleus.

In particular, it is known that Unix systems at one time were divided into two camps: the descendants of UNIX System V Release 4 (SVR4 family) versus the descendants of Berkley Software Distribution v4.2 (BSD4.2). Linux largely belongs to the first family, but borrows some significant ideas from the second.

Core in numbers

About 30 thousand files
About 8 million lines of code (not including comments)
The repository takes up about 1 GB
linux-2.6.33.tar.bz2: 63 Mb
patch-2.6.33.bz2: 10Mb, about 1.7 million changed lines
About 6000 people whose code is in the core

About the kernel architecture

All (or almost all) processors that Unix-like OS manufacturers have ever been interested in have hardware support for privilege sharing. One code can do everything (including communicate directly with the equipment), the other can do almost nothing. Traditionally they talk about “kernel mode” (kernel land) and “user mode” (user land). Different OS kernel architectures differ primarily in their approach to answering the question: which parts of the OS code should be executed in the kernel land and which in the user land? The fact is that for the vast majority of processors, switching between two modes takes significant time. The following approaches are distinguished:

Traditional: monolithic core. All kernel code is compiled into one large binary file. The entire kernel runs in kernel mode;

Opposite, innovative: microkernel. In kernel mode, only the most necessary parts are executed, everything else is executed in user mode;

In the traditional approach, a variant later appeared: a modular kernel. Everything is executed in kernel mode, but the kernel is compiled into one large binary file and a bunch of small modules that can be loaded and unloaded as needed;

And, of course, all kinds of hybrid architectures.

The Linux kernel started out as a monolithic one (looking at the Unixes that existed at that time). The modern Linux kernel is modular. Compared to a microkernel, a monolithic (or modular) kernel provides significantly greater performance, but imposes significantly more stringent requirements on the code quality of various components. Thus, in a system with a microkernel, a “crashed” FS driver will be restarted without affecting the operation of the system; A crashed FS driver in a monolithic kernel means Kernel panic and system shutdown.

Linux kernel subsystems

There is a fairly well-known diagram depicting the main subsystems of the Linux kernel and how they interact. Here she is:

Actually, in currently it is only clear that there are many parts and their relationships are very complex. Therefore, we will consider a simplified diagram:

System calls

The system call layer is the part of the Linux kernel that is closest to the application programmer. System calls provide an interface used by application programs—the kernel API. Most Linux system calls are taken from the POSIX standard, but there are also Linux-specific system calls.

Here it is worth noting some difference in the approach to designing the kernel API in Unix systems, on the one hand, and in Windows and other ideological descendants of VMS, on the other. Unix designers prefer to provide ten system calls with one parameter instead of one system call with twenty parameters. A classic example is creating a process. IN Windows function to create a process - CreateProcess() - takes 10 arguments, 5 of which are structures. In contrast, Unix systems provide two system calls (fork() and exec()), the first with no parameters at all, the second with three parameters.

System calls, in turn, access functions of lower-level kernel subsystems.

Memory management

The Linux kernel uses a page as its minimum unit of memory. Page size may vary depending on hardware; on x86 it is 4Kb. To store information about a physical memory page (its physical address, ownership, mode of use, etc.), a special page structure of 40 bytes in size is used.

The kernel uses the features modern processors for organizing virtual memory. Thanks to manipulation of the virtual memory page directories, each process receives an address space of 4 GB (on 32-bit architectures). Part of this space is available to the process only for reading or execution: kernel interfaces are mapped there.

It is significant that a process running in user space in most cases does not “know” where its data is located: in RAM or in the page file. A process can ask the system to allocate memory to it specifically in RAM, but the system is not obligated to grant such a request.

Process management

The Linux kernel has been multitasking literally from day one. By now it has quite good support preemptive multitasking.

There have been two types of multitasking known in history:

Corporate multitasking.

In this option, each process transfers control to some other process when it sees fit. This saves time on switching processor modes, but, obviously, there is no need to talk about the reliability of such a system: a frozen process will not transfer control to anyone. This option is not used in modern operating systems.

Preemptive multitasking.

The OS kernel allocates a certain quantum of processor time to each process and “forcibly” transfers control to another process after this quantum expires. This creates overhead for switching processor modes and calculating priorities, but improves reliability and performance.

Process switching in Linux can be done upon the occurrence of two events: a hardware interrupt or a timer interrupt. The timer interrupt frequency is set when the kernel is compiled in the range from 100Hz to 1000Hz. Hardware interrupts occur almost more often: just move the mouse or press a button on the keyboard, and internal devices computers generate interrupts. Starting with version 2.6.23, it became possible to build a kernel that does not use timer-based process switching. This allows you to reduce power consumption when the computer is idle.

The process scheduler uses a rather complex algorithm based on calculating process priorities. Among the processes, there are those that require a lot of CPU time and those that spend more time on I/O. Based on this information, process priorities are regularly recalculated. In addition, user-specified nice values for individual processes are used.

In addition to user-mode multitasking, the Linux kernel uses kernel-mode multitasking: the kernel itself is multithreaded.

Traditional Unix kernels had the following... well, if not a problem, then a feature: the kernel itself was not preemptible. Example: process /usr/bin/cat wants to open a file /media/cdrom/file.txt and uses a system call for this open(). Control is transferred to the kernel. The kernel detects that the file is located on a CD and begins initializing the drive (spinning the disk, etc.). This takes significant time. All this time, control is not transferred to user processes, because the scheduler is not active while kernel code is running. All user processes wait for this open() call to complete.

In contrast, the modern Linux kernel is completely preemptible. The scheduler is disabled only for short periods of time when the kernel cannot be interrupted in any way - for example, during the initialization of certain devices that require certain actions to be performed with fixed delays. At any other time, the kernel thread may be preempted and control transferred to another kernel thread or user process.

Network subsystem

The network subsystem of the OS kernel, theoretically, can almost all be executed in user space: no privileges are needed for operations such as generating TCP/IP packets. However, in modern operating systems, especially those used on highly loaded servers, from the entire network stack - the entire chain from the formation of packets to working directly with network adapter— maximum performance is required. A network subsystem running in user space would have to constantly access the kernel to communicate with network equipment, and this would entail very significant overhead.

The Linux network subsystem provides the following functionality:

Socket abstraction;

Network protocol stacks (TCP/IP, UDP/IP, IPX/SPX, AppleTalk, etc.);

Routing;

Packet filter (Netfilter module);

Abstraction of network interfaces.

Various Unix systems used two different application interfaces to provide access to networking subsystem functionality: Transport Layer Interface (TLI) from SVR4 and sockets from BSD. The TLI interface, on the one hand, is closely tied to the STREAMS subsystem, which is absent in the Linux kernel, and on the other hand, it is not compatible with the socket interface. Therefore, Linux uses a socket interface taken from the BSD family.

File system

Virtual File System (VFS)

From an application perspective, Unix-like OSes have only one file system. It is a directory tree growing from the “root”. Applications, in most cases, do not care what media the file data is on; they can be located on the hard drive, optical disk, flash drive or even on another computer and another continent. This abstraction and the subsystem that implements it are called the virtual file system (VFS).

It is worth noting that VFS in the Linux kernel is implemented taking into account ideas from OOP. For example, the kernel considers a set of inode structures, each of which contains (among other things):

Data from on-disk inode (access rights, file size, etc.);

Pointer to a structure describing the FS driver to which this inode belongs;

A pointer to the structure of operations with inode, which, in turn, contains pointers to functions for creating an inode, changing its attributes, etc., implemented in a specific FS driver.

The kernel structures that describe other FS entities—superblock, directory element, file—are arranged similarly.

FS Drivers

FS drivers, as can be seen from the diagram, belong to a much higher level than device drivers. This is due to the fact that FS drivers do not communicate with any devices. The file system driver only implements the functions provided to it through the VFS interface. In this case, data is written and read to/from the memory page; which of them and when will be recorded on the media is decided by more low level. The fact that FS drivers in Linux do not communicate with hardware allowed the implementation of a special FUSE driver that delegates the functionality of the FS driver to modules that execute in user space.

Page cache

This kernel subsystem operates on virtual memory pages organized in a radix tree. When data is read from the media, the data is read into a page allocated in the cache, and the page remains in the cache, and the FS driver reads data from it. The FS driver writes data to memory pages located in the cache. At the same time, these pages are marked as “dirty”. A special kernel thread, pdflush, regularly bypasses the cache and generates write requests for dirty pages. A dirty page written to the media is again marked as clean.

Block I/O level

This kernel subsystem operates with queues consisting of bio structures. Each such structure describes one I/O operation (relatively speaking, a request like “write this data to blocks ##141-142 of device /dev/hda1”). For each process performing I/O, its own queue is formed. From this set of queues, one queue of requests is created for each device driver.

I/O Scheduler

If you execute disk I/O requests from applications in the order in which they arrive, system performance will be very poor on average. This is due to the fact that the operation of searching for the desired sector on the hard drive is very slow. Therefore, the scheduler processes request queues by performing two operations:

Sorting: the scheduler tries to place requests in a row that access nearby disk sectors;

Merging: if, as a result of sorting, there are several queries nearby that access sequentially located sectors, they need to be combined into one query.

Several schedulers are available in the modern kernel: Anticipatory, Deadline, CFQ, noop. There is a version of the kernel from Con Kolivas with another scheduler - BFQ. Schedulers can be selected when the kernel is compiled or when it is launched.

Special mention should be made of the noop scheduler. This scheduler does not sort or merge requests, but forwards them to device drivers in the order they are received. On systems with regular hard drives, this scheduler will show very poor performance. However, systems are now becoming common in which, instead of hard drives flash media are used. For such media, the sector search time is zero, so sorting and merging operations are not needed. In such systems, when using the noop scheduler, performance will not change, and resource consumption will be slightly reduced.

Interrupt handling

Almost all current hardware architectures use devices to communicate with software concept of interruptions. It looks like this. Some process is running on the processor (it doesn't matter whether it's a kernel thread or a user process). There is an interruption from the device. The processor is distracted from current tasks and switches control to the memory address associated to this number interrupts in a special interrupt table. The interrupt handler is located at this address. The handler performs some actions depending on what exactly happened to this device, then control is transferred to the process scheduler, which, in turn, decides who to transfer control to next.

There is a certain subtlety here. The fact is that while the interrupt handler is running, the task scheduler is not active. This is not surprising: the interrupt handler is supposed to work directly with the device, and the device may require some actions to be performed within a strict time frame. Therefore, if the interrupt handler runs for a long time, then all other processes and kernel threads will wait, and this is usually unacceptable.

In the Linux kernel, as a result of any hardware interrupt, control is transferred to the do_IRQ() function. This function uses a separate table of kernel-registered interrupt handlers to determine where to pass control next.

To ensure minimal execution time in the context of an interrupt, the Linux kernel uses a division of handlers into upper and lower halves. The top half is a function that is registered by the device driver as a handler for a specific interrupt. It only does the work that absolutely must be done immediately. It then registers another function (its bottom half) and returns. The task scheduler will transfer control to the registered top half as soon as possible. In most cases, control is transferred to the lower half immediately after the upper half completes its work. But at the same time, the lower half works as a regular kernel thread, and can be interrupted at any time, and therefore it has the right to execute for as long as desired.

As an example, we can consider the interrupt handler from the network card, which reports that an ethernet packet has been received. This handler must do two things:

Take a packet from the network card buffer and signal network card that the package has been received operating system. This must be done immediately upon receipt of the interrupt; after a millisecond, completely different data will be in the buffer;

Place this packet in some kernel structure, find out which protocol it belongs to, and pass it to the appropriate processing functions. This should be done as quickly as possible to ensure maximum network subsystem performance, but not necessarily immediately.

Accordingly, the first is performed by the upper half of the handler, and the second by the lower.