Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

- [Why is my core dump truncated?](#why-is-my-core-dump-truncated)

- [Why is my zip file corrupted?](#why-is-my-zip-file-corrupted)

- [Why is my log file exactly half of my configured line count?](#why-is-my-log-file-exactly-half-of-my-configured-line-count)

- [Can I force an upload?](#can-i-force-an-upload)
Expand All @@ -12,6 +14,8 @@

- [How do I use the custom endpoint?](#how-do-i-use-the-custom-endpoint)

- [Why am I getting the wrong container info?](#why-am-i-getting-the-wrong-container-info)

## How should I integrate my own uploader?

The core dump handler is designed to quickly move the cores *"off-box"* to an object storage environment with as much additional runtime information as possible.
Expand Down Expand Up @@ -73,6 +77,14 @@ terminationGracePeriodSeconds: 120
```
Also see [Kubernetes best practices: terminating with grace](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace)

## Why is my zip file corrupted?

As of v8.7.0 there is now have a timer on the core dump to prevent repeated hanging core dumps taking down the system.
For very large core dumps this means the process can be truncated and the zipfile incomplete.

In v8.8.0 We have added the nocompression option to zip process to improve performance and you can increase the timeout default which is currently set to 10 minutes.


## Why is my log file exactly half of my configured line count?

This appears to be a bug in some kubernetes services.
Expand Down Expand Up @@ -134,3 +146,9 @@ extraEnvVars: |
- name: S3_ENDPOINT
value: https://the-endpoint
```

## Why am I getting the wrong container info?

Core dump handler trys to find the container information for the crashing process based on the hostname of the pod. This works fine in most scenarios but when pods are created directly in multiple namespaces or the same Statefulsets are created in the same namespaces.

The current recommendation is to create a unique name in both of those scenarios. [See issue 115](https://github.com/IBM/core-dump-handler/issues/115)
12 changes: 11 additions & 1 deletion charts/core-dump-handler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,14 @@ The agent pod has the following environment variables and these are all set by t
"img" (Default): This is the value most crictls expect.
"images": Digital Ocean, Newer OpenShift require this value

* COMP_TIMEOUT - The timeout for the composer in seconds. Defaults to 600.

In testing ~ 3 mins per 512Mb so we have set it to 10 mins.

* COMP_COMPRESSION - Enable compression Default: true

Given the amount of time compression there is an option to disable it.

* CRIO_ENDPOINT - The CRIO endpoint to use.

"unix:///run/containerd/containerd.sock" (Default): This is the default for most containerd nodes
Expand Down Expand Up @@ -252,7 +260,9 @@ Composer
* logLevel: The log level for the composer (Default "Warn")
* ignoreCrio: Maps to the COMP_IGNORE_CRIO enviroment variable (Default false)
* crioImageCmd: Maps to the COMP_CRIO_IMAGE_CMD enviroment variable (Default "img")
* filenameTemplate: Maps to COMP_FILENAME_TEMPLATE environment variable
* timeout: Maps to the COMP_TIMEOUT environment variable ("Default 600)
* compression: Maps to the COMP_COMPRESSION environment varable (Default "true")
* filenameTemplate: Maps to COMP_FILENAME_TEMPLATE environment variable
(Default {{uuid}}-dump-{{timestamp}}-{{hostname}}-{{exe_name}}-{{pid}}-{{signal}})

Possible Values:
Expand Down
4 changes: 4 additions & 0 deletions charts/core-dump-handler/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@ spec:
value: {{ .Values.composer.crioImageCmd }}
- name: COMP_POD_SELECTOR_LABEL
value: {{ .Values.composer.podSelectorLabel }}
- name: COMP_TIMEOUT
value: {{ .Values.composer.timeout | quote }}
- name: COMP_COMPRESSION
value: {{ .Values.composer.compression | quote }}
- name: DEPLOY_CRIO_CONFIG
value: {{ .Values.daemonset.deployCrioConfig | quote }}
- name: CRIO_ENDPOINT
Expand Down
15 changes: 12 additions & 3 deletions charts/core-dump-handler/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -115,14 +115,23 @@
},
"podSelectorLabel": {
"type": "string"
},
"timeout": {
"type": "integer",
"minimum": 120
},
"compression": {
"type": "boolean"
}
},
"required": [
"crioImageCmd",
"ignoreCrio",
"logLevel",
"logLength",
"filenameTemplate"
"filenameTemplate",
"timeout",
"compression"
],
"title": "Composer"
},
Expand Down Expand Up @@ -183,7 +192,7 @@
"hostContainerRuntimeEndpoint"
]
}
}
}
],
"properties": {
"name": {
Expand Down Expand Up @@ -316,4 +325,4 @@
"title": "ServiceAccount"
}
}
}
}
4 changes: 3 additions & 1 deletion charts/core-dump-handler/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ replicaCount: 1
image:
registry: quay.io
repository: icdh/core-dump-handler
tag: v8.7.0
tag: schema-updates
pullPolicy: Always
pullSecrets: []
request_mem: "64Mi"
Expand All @@ -27,6 +27,8 @@ composer:
filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}"
logLength: 500
podSelectorLabel: ""
timeout: 600
compression: true

daemonset:
name: "core-dump-handler"
Expand Down
16 changes: 10 additions & 6 deletions core-dump-agent/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ async fn main() -> Result<(), anyhow::Error> {
async fn process_file(zip_path: &Path, bucket: &Bucket) {
info!("Uploading: {}", zip_path.display());

let f = File::open(&zip_path).expect("no file found");
let f = File::open(zip_path).expect("no file found");

match f.try_lock(FileLockMode::Shared) {
Ok(_) => { /* If we can lock then we are ok */ }
Expand All @@ -305,7 +305,7 @@ async fn process_file(zip_path: &Path, bucket: &Bucket) {
}
}

let metadata = fs::metadata(&zip_path).expect("unable to read metadata");
let metadata = fs::metadata(zip_path).expect("unable to read metadata");
info!("zip size is {}", metadata.len());
let path_str = match zip_path.to_str() {
Some(v) => v,
Expand Down Expand Up @@ -473,11 +473,15 @@ fn create_env_file(host_location: &str) -> Result<(), std::io::Error> {
});
let log_length = env::var("LOG_LENGTH").unwrap_or_else(|_| "500".to_string());
let pod_selector_label = env::var("COMP_POD_SELECTOR_LABEL").unwrap_or_default();
let timeout = env::var("COMP_TIMEOUT").unwrap_or_else(|_| "600".to_string());
let compression = env::var("COMP_COMPRESSION")
.unwrap_or_else(|_| "true".to_string())
.to_lowercase();
info!("Creating {} file with LOG_LEVEL={}", destination, loglevel);
let mut env_file = File::create(destination)?;
let text = format!(
"LOG_LEVEL={}\nIGNORE_CRIO={}\nCRIO_IMAGE_CMD={}\nUSE_CRIO_CONF={}\nFILENAME_TEMPLATE={}\nLOG_LENGTH={}\nPOD_SELECTOR_LABEL={}\n",
loglevel, ignore_crio, crio_image, use_crio_config, filename_template, log_length, pod_selector_label
"LOG_LEVEL={}\nIGNORE_CRIO={}\nCRIO_IMAGE_CMD={}\nUSE_CRIO_CONF={}\nFILENAME_TEMPLATE={}\nLOG_LENGTH={}\nPOD_SELECTOR_LABEL={}\nTIMEOUT={}\nCOMPRESSION={}\n",
loglevel, ignore_crio, crio_image, use_crio_config, filename_template, log_length, pod_selector_label, timeout, compression
);
info!("Writing composer .env \n{}", text);
env_file.write_all(text.as_bytes())?;
Expand All @@ -496,7 +500,7 @@ fn get_sysctl(name: &str) -> Result<String, anyhow::Error> {
info!("Getting sysctl for {}", name);
let output = Command::new("sysctl")
.env("PATH", get_path())
.args(&["-n", name])
.args(["-n", name])
.output()?;
let lines = String::from_utf8(output.stdout)?;
let line = lines.lines().take(1).next().unwrap_or("");
Expand All @@ -522,7 +526,7 @@ fn overwrite_sysctl(name: &str, value: &str) -> Result<(), anyhow::Error> {
let s = format!("{}={}", name, value);
let output = Command::new("sysctl")
.env("PATH", get_path())
.args(&["-w", s.as_str()])
.args(["-w", s.as_str()])
.status()?;
if !output.success() {
let e = Error::InvalidOverWrite {
Expand Down
2 changes: 1 addition & 1 deletion core-dump-agent/tests/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ fn basic() -> Result<(), std::io::Error> {
"FILENAME_TEMPLATE={uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}"
));
assert!(env_content.contains("LOG_LENGTH=500"));
assert_eq!(env_content.lines().count(), 7);
assert_eq!(env_content.lines().count(), 9);
//TODO: [No9] Test uploading of a corefile
//TODO: [No9] Test remove option
//TODO: [No9] Test sweep option
Expand Down
2 changes: 1 addition & 1 deletion core-dump-composer/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ uuid = { version = "1.1.0", features = ["serde", "v4"] }
zip = "0.6.2"
dotenv = "0.15.0"
log = "0.4.14"
log4rs = { git = "https://github.com/No9/log4rs/", branch = "typemap-ors-fix" }
log4rs = "1.2.0"
anyhow = "1.0.53"
serde_json = "1.0.76"
serde = { version = "1.0.134", features = ["derive"] }
Expand Down
28 changes: 18 additions & 10 deletions core-dump-composer/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,13 @@ pub struct CoreConfig {
pub pod_selector_label: String,
pub use_crio_config: bool,
pub ignore_crio: bool,
pub timeout: u32,
pub compression: bool,
pub image_command: ImageCommand,
pub bin_path: String,
pub os_hostname: String,
pub filename_template: String,
pub params: CoreParams,
pub disable_compression: bool,
}

#[derive(Serialize)]
Expand All @@ -39,7 +40,6 @@ pub struct CoreParams {
pub directory: String,
pub hostname: String,
pub pathname: String,
pub timeout: u64,
pub namespace: Option<String>,
pub podname: Option<String>,
pub uuid: Uuid,
Expand All @@ -58,12 +58,12 @@ impl CoreConfig {
let directory = matches.value_of("directory").unwrap_or("").to_string();
let hostname = matches.value_of("hostname").unwrap_or("").to_string();
let pathname = matches.value_of("pathname").unwrap_or("").to_string();
let timeout = matches
.value_of("timeout")
.unwrap_or("600")
.parse::<u64>()
.unwrap();
let disable_compression = matches.contains_id("disable-compression");
// let timeout = matches
// .value_of("timeout")
// .unwrap_or("600")
// .parse::<u64>()
// .unwrap();
// let disable_compression = matches.contains_id("disable-compression");

let uuid = Uuid::new_v4();

Expand All @@ -76,7 +76,6 @@ impl CoreConfig {
directory,
hostname,
pathname,
timeout,
namespace: None,
podname: None,
uuid,
Expand Down Expand Up @@ -112,6 +111,14 @@ impl CoreConfig {
.unwrap_or_else(|_| "false".to_string().to_lowercase())
.parse::<bool>()
.unwrap();
let compression = env::var("COMPRESSION")
.unwrap_or_else(|_| "true".to_string().to_lowercase())
.parse::<bool>()
.unwrap();
let timeout = env::var("TIMEOUT")
.unwrap_or_else(|_| "600".to_string())
.parse::<u32>()
.unwrap();
let os_hostname = hostname::get()
.unwrap_or_else(|_| OsString::from_str("unknown").unwrap_or_default())
.into_string()
Expand Down Expand Up @@ -146,7 +153,8 @@ impl CoreConfig {
filename_template,
log_length,
params,
disable_compression,
compression,
timeout,
})
}

Expand Down
11 changes: 5 additions & 6 deletions core-dump-composer/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,13 @@ mod logging;
fn main() -> Result<(), anyhow::Error> {
let (send, recv) = channel();
let cc = config::CoreConfig::new()?;
let timeout = cc.params.timeout;

let recv_time: u64 = cc.timeout as u64;
thread::spawn(move || {
let result = handle(cc);
send.send(result).unwrap();
});

let result = recv.recv_timeout(Duration::from_secs(timeout));
let result = recv.recv_timeout(Duration::from_secs(recv_time));

match result {
Ok(inner_result) => inner_result,
Expand Down Expand Up @@ -111,10 +110,10 @@ fn handle(mut cc: config::CoreConfig) -> Result<(), anyhow::Error> {
cc.set_podname(podname.to_string());

// Create the base zip file that we are going to put everything into
let compression_method = if cc.disable_compression {
zip::CompressionMethod::Stored
} else {
let compression_method = if cc.compression {
zip::CompressionMethod::Deflated
} else {
zip::CompressionMethod::Stored
};
let options = FileOptions::default()
.compression_method(compression_method)
Expand Down
3 changes: 1 addition & 2 deletions core-dump-composer/tests/timeout.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ fn timeout_scenario() -> Result<(), std::io::Error> {
.unwrap();

let cdc = Command::new("../target/debug/core-dump-composer")
.env("TIMEOUT", "1")
.arg("-c")
.arg("1000000000")
.arg("-e")
Expand All @@ -60,8 +61,6 @@ fn timeout_scenario() -> Result<(), std::io::Error> {
.arg("1588462466")
.arg("-h")
.arg("crashing-app-699c49b4ff-86wrh")
.arg("--timeout")
.arg("1")
.stdin(cat)
.output()
.expect("Couldn't execute");
Expand Down