Linux Internals: May 2014

Monday, May 12, 2014

All About Ftrace

Ftrace is a Kernel function tracer. It uses the file system debugfs.

To use Ftrace, these options needs to be enabled if it's not.

CONFIG_FTRACE,
CONFIG_HAVE_DYNAMIC_FTRACE,
CONFIG_HAVE_FUNCTION_TRACER,
CONFIG_HAVE_FUNCTION_GRAPH_TRACER,
CONFIG_STACKTRACE

To mount debugfs

mount -t debugfs nodev /sys/kernel/debug

If /sys/kernel/debug is present re-run the above command again.

ftrace directory will be created under /sys/kernel/debug/tracing

The directory contains all the control knobs in regard to kernel tracing.

Some of the important files can be referred when using ftrace
1) available_filter_funtions: All the functions ftrace is able to trace.
2) available_tracer: List all the tracer compiled into the kernel.
3) current_tracer: shows the currently selected tracer.
4) trace : It holds the output of what being traced in readable format.
5) trace_options: To control the level of output in trace output
                 To enable block tracing echo block > trace_options
                 To disable block tracing echo noblock > trace_options


6) tracing_enabled : To start or stop tracing activity

To enable ftrace

echo function > /sys/kernel/debug/tracing/current_tracer

To enable/disable tracing

echo 1 > tracing_on : To enable tracing
echo 0 > tracing_on   : To disable tracing
echo > trace   : To clear trace log file

echo nop > current_tracer

To Trace or Monitor Block IO
echo 1 > events/block/enable (enable block I/O subsystem)

cat set_event (To display all the subsystem event enabled)

echo 1 > tracing_on
run your program
echo 0 > tracing_on

cat trace to output the ftrace output.

Example of tracing a specific process

traceme.sh

#!/bin/sh
DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`
echo nop > $DEBUGFS/tracing/current_tracer
echo > $DEBUGFS/tracing/trace
echo $1
echo $$ > $DEBUGFS/tracing/set_ftrace_pid
echo function > $DEBUGFS/tracing/current_tracer
#echo function_graph > $DEBUGFS/tracing/current_tracer
echo 1 > $DEBUGFS/tracing/tracing_on
exec $*
#echo nop > $DEBUGFS/tracing/current_tracer
echo 0 > $DEBUGFS/tracing/tracing_on

echo sys_* > set_ftrace_filter
echo vfs_* >> set_ftrace_filter

traceme.sh ls -al

Thursday, May 8, 2014

All About NVME

NVMe stands for Non-Volatile Memory over PCIe. Designed for SSD and for low latency response.

Architecture of NVME on linux looks like this

NVMe controller register provides BAR0 and BAR1 for mapping internal control register.

NVMe HCI model has concept of Completion Queue, Submission Queue and Doorbell register.

There are 2 type of Queues
1) Admin Queues
2) I/O Queues

Host Software creates Admin Queue first (Admin Queue Structure Initialization etc..)

Host uses Admin Commands (Submitted to Admin Queue) to create I/O queue pair (Submission and Completion Queue)

Below is the layout of Control Register of NVMe. Host writes to Admin SQ (0x28h) and CQ Base (0x30h) Address in local memory mapped address.

Important Registers
Admin Queue Attributes (AQA: 0x24h) ASQ0 Size/ACQ0 Size.

Assign base address to ASQ and ACQ based on ASQ and ACQ size to submit any admin command.

Host create I/O Submission and Completion Queue by putting Admin command in new Admin Queue.

Some of the Admin Commands are
1) Delete IO SQ
2) Create IO SQ
3) Create IO CQ
4) Delete IO CQ
5) Identify,
6) Firmware Activate/Image Download

Multiple I/O Submission Queues are possible
1) Load Distribution across CPU cores
2) One CQ serving multiple SQ.
3) Avoid locking overhead.
4) Queue priority

Once Submission Queue is created host can submit I/O Commands

Support IO Commands are
1) Flush
2) Write
3) Read

Submitting IO Command Host places address of data buffer into submission queue and trigger SQ tail Doorbell register.

NVME Doorbell follows a Producer/Consumer model

Host acts as
1) Producer of commands -> updates SQ Tail Pointer
2) Consumer of completions -> updates CQ Head Pointer

Controller acts as
1) Consumer of Commands ->update SQ Head Pointer
2) Producer of completions -> updates tail of CQ pointer

Lets consider a scenario

Initial State

SQ1 = { empty }
CQ1 = { empty }
SQ1TailDB = {0}
SQ1HeadDB = {0}
CQ1TailDB = {0}
CQ1HeadDB = {0}

Host add 3 commands

SQ1= {CMD0, CMD1, CMD2, ..... };
SQ1TailDB = {3}

Controller Fetches 3 commands
SQ1HeadDB = {3}
SQ1= {empty} //marked empty

Controller Post completions (Let's say it post 2 completions)
CQ1 = {CMD0, CMD1, empty ......}
CQ1TailDB = {2}

Host is interrupt when CQ1TailDB is updated
Host reads CQ1 and update CQ1HeadDB.

CQ1 = {empty}
CQ1HeadDB={2}

Each command submitted to SQ is 64bytes in size. Command DW0, NSID, Metadata pointer, PRP Entry 1 and PRP Entry 2 have common definitions for all Admin Commands and NVM commands.

Command DW0 format is defined in below figure.